In nested resampling, classification accuracy results change wildly - random-forest

Using mlr package in R, I am creating random forest models. To evaluate classification accuracy of the model I am using nested resampling as described in here. My problem is that classification accuracies of the random forest models within the inner loops are usually 15% higher than the outer loop results. I am observing classification accuracies of ~85% within the inner loop, but the accuracy of the outer loop usually ends up around 70%. I cannot provide data here but I am pasting the code I am using.
How is that possible? What may be the reason?
rf_param_set <- makeParamSet(
ParamHelpers::makeDiscreteParam('mtry', values = c(3, 7, 14)),
ParamHelpers::makeDiscreteParam('ntree', values = c(1000, 2000))
)
rf_tune_ctrl <- makeTuneControlGrid()
rf_inner_resample <- makeResampleDesc('Bootstrap', iters = 5)
acc632plus <- setAggregation(acc, b632plus)
rf_learner <- makeTuneWrapper('classif.randomForest',
resampling = rf_inner_resample,
measures = list(acc),
par.set = rf_param_set,
control = rf_tune_ctrl,
show.info = TRUE)
# rf_outer_resample <- makeResampleDesc('Subsample', iters = 10, split = 2/3)
rf_outer_resample <- makeResampleDesc('Bootstrap', iters = 10, predict = 'both')
rf_result_resample <- resample(rf_learner, clf_task,
resampling = rf_outer_resample,
extract = getTuneResult,
measures = list(acc, acc632plus),
show.info = TRUE)
You can the resulting output below.
Resampling: OOB bootstrapping
Measures: acc.train acc.test acc.test
[Tune] Started tuning learner classif.randomForest for parameter set:
Type len Def Constr Req Tunable Trafo
mtry discrete - - 3,7,14 - TRUE -
ntree discrete - - 1000,2000 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mtry=3; ntree=1000
[Tune-y] 1: acc.test.mean=0.8415307; time: 0.1 min
[Tune-x] 2: mtry=7; ntree=1000
[Tune-y] 2: acc.test.mean=0.8405726; time: 0.1 min
[Tune-x] 3: mtry=14; ntree=1000
[Tune-y] 3: acc.test.mean=0.8330845; time: 0.1 min
[Tune-x] 4: mtry=3; ntree=2000
[Tune-y] 4: acc.test.mean=0.8415809; time: 0.3 min
[Tune-x] 5: mtry=7; ntree=2000
[Tune-y] 5: acc.test.mean=0.8395083; time: 0.3 min
[Tune-x] 6: mtry=14; ntree=2000
[Tune-y] 6: acc.test.mean=0.8373584; time: 0.3 min
[Tune] Result: mtry=3; ntree=2000 : acc.test.mean=0.8415809
[Resample] iter 1: 0.9961089 0.7434555 0.7434555
[Tune] Started tuning learner classif.randomForest for parameter set:
Type len Def Constr Req Tunable Trafo
mtry discrete - - 3,7,14 - TRUE -
ntree discrete - - 1000,2000 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mtry=3; ntree=1000
[Tune-y] 1: acc.test.mean=0.8479891; time: 0.1 min
[Tune-x] 2: mtry=7; ntree=1000
[Tune-y] 2: acc.test.mean=0.8578465; time: 0.1 min
[Tune-x] 3: mtry=14; ntree=1000
[Tune-y] 3: acc.test.mean=0.8556608; time: 0.1 min
[Tune-x] 4: mtry=3; ntree=2000
[Tune-y] 4: acc.test.mean=0.8502869; time: 0.3 min
[Tune-x] 5: mtry=7; ntree=2000
[Tune-y] 5: acc.test.mean=0.8601446; time: 0.3 min
[Tune-x] 6: mtry=14; ntree=2000
[Tune-y] 6: acc.test.mean=0.8586638; time: 0.3 min
[Tune] Result: mtry=7; ntree=2000 : acc.test.mean=0.8601446
[Resample] iter 2: 0.9980545 0.7032967 0.7032967
[Tune] Started tuning learner classif.randomForest for parameter set:
Type len Def Constr Req Tunable Trafo
mtry discrete - - 3,7,14 - TRUE -
ntree discrete - - 1000,2000 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mtry=3; ntree=1000
[Tune-y] 1: acc.test.mean=0.8772566; time: 0.1 min
[Tune-x] 2: mtry=7; ntree=1000
[Tune-y] 2: acc.test.mean=0.8750990; time: 0.1 min
[Tune-x] 3: mtry=14; ntree=1000
[Tune-y] 3: acc.test.mean=0.8730733; time: 0.1 min
[Tune-x] 4: mtry=3; ntree=2000
[Tune-y] 4: acc.test.mean=0.8782829; time: 0.3 min
[Tune-x] 5: mtry=7; ntree=2000
[Tune-y] 5: acc.test.mean=0.8741619; time: 0.3 min
[Tune-x] 6: mtry=14; ntree=2000
[Tune-y] 6: acc.test.mean=0.8687918; time: 0.3 min
[Tune] Result: mtry=3; ntree=2000 : acc.test.mean=0.8782829
[Resample] iter 3: 0.9902724 0.7329843 0.7329843

What you're seeing is exactly the reason you want to use nested resampling -- the inner resampling loop overfits (to some extent) to the data, and gives a misleading impression of the generalization performance. With the outer resampling in place, you can detect that (accuracy is lower).
The mlr tutorial has a much more detailed page on this (https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html). In general, you're not seeing these results because you're doing anything wrong (unless you split the data manually in a certain way), you're just using a powerful optimization method that optimizes a bit more than it should -- but you're detecting that with the nested resampling.
You could try to use cross-validation instead of bootstrapping; this may provide more consistent results.

Related

How can I improve this Reinforced Learning scenario in Stable Baselines3?

In this scenario, I present a box observation with numbers 0, 1 or 2 and shape (1, 10).
The odds for 0 and 2 are 2% each, and 96% for 1.
I want the model to learn to pick the index of any 2 that comes. If it doesn't have a 2, just choose 0.
Bellow is my code:
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import PPO, DQN, A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack
action_length = 10
class TestBot(gym.Env):
def __init__(self):
super(TestBot, self).__init__()
self.total_rewards = 0
self.time = 0
self.action_space = spaces.Discrete(action_length)
self.observation_space = spaces.Box(low=0, high=2, shape=(1, action_length), dtype=np.float32)
def generate_next_obs(self):
p = [0.02, 0.02, 0.96]
a = [0, 2, 1]
self.observation = np.random.choice(a, size=(1, action_length), p=p)
if 2 in self.observation[0][1:]:
self.best_reward += 1
def reset(self):
if self.time != 0:
print('Total rewards: ', self.total_rewards, 'Best possible rewards: ', self.best_reward)
self.best_reward = 0
self.time = 0
self.generate_next_obs()
self.total_rewards = 0
self.last_observation = self.observation
return self.observation
def step(self, action):
reward = 0
if action != 0:
last_value = self.last_observation[0][action]
if last_value == 2:
reward = 1
else:
reward = -1
self.time += 1
self.generate_next_obs()
done = self.time == 4096
info = {}
self.last_observation = self.observation
self.total_rewards += reward
return self.observation, reward, done, info
For training, I used the following:
env = TestBot()
env = make_vec_env(lambda: env, n_envs=1)
model = PPO('MlpPolicy', env, verbose=0)
iters = 0
while True:
iters += 1
model.learn(total_timesteps=4096, reset_num_timesteps=True)
PPO gave the best result, which wasn't so great. It learned to have positive rewards, but took a long time and got stuck in a point far from optimal.
How can I improve the learning of this scenario?
I managed to solve my problem by tunning the PPO parameters.
I had to change the following parameters:
gamma: from 0.99 to 0. It determines the importance of future rewards in the decision-making process. A value of 0 means that only imediate rewards should be considered.
gae_lambda: from 0.95 to 0.65. The gae_lambda parameter in Reinforcement Learning is used in the calculation of the Generalized Advantage Estimation (GAE). GAE is a method for estimating the advantage function in reinforcement learning, which is a measure of how much better a certain action is compared to the average action. A lower value means that PPO doesn't need to use the GAE too much.
clip_range: from 0.2 to function based. It determines the percentage of the decisions that will be done for exploration. At the end, exploration starts to be irrelevant. So, I made a function that uses a high exploration in the first few iteractions and goes to 0 at the end.
I also made a small modification in the environment in order to penalize more the loss of oportunity of picking a number 2 index, but that is done just to accelerate the training.
The following is my final code:
env = TestBot()
env = make_vec_env(lambda: env, n_envs=1)
iters = 0
def clip_range_schedule():
def real_clip_range(progress):
global iters
cr = 0.2
if iters > 20:
cr = 0.0
elif iters > 12:
cr = 0.05
elif iters > 6:
cr = 0.1
return cr
return real_clip_range
model = PPO('MlpPolicy', env, verbose=0, gamma=0.0, gae_lambda=0.65, clip_range=clip_range_schedule())
while True:
iters += 1
model.learn(total_timesteps=4096, reset_num_timesteps=True)

Huge loss value to NaN on regularization and dropout in a deep neural network

I'm taking the Deep Learning course on Udacity. One of the tasks that is given is to implement regularization and dropout into a multi layer neural network.
After implementation, my minibatch loss in insanely high at step 0, changes to infinity at step 1, and then becomes non existent for the rest of the output
Offset at step 0: 0
Minibatch loss at step 0: 187359330304.000000
Minibatch accuracy: 10.2%
Validation accuracy: 10.0%
Offset at step 1: 128
Minibatch loss at step 1: inf
Minibatch accuracy: 14.1%
Validation accuracy: 10.0%
Offset at step 2: 256
Minibatch loss at step 2: nan
Minibatch accuracy: 7.8%
Validation accuracy: 10.0%
Offset at step 3: 384
Minibatch loss at step 3: nan
Minibatch accuracy: 11.7%
Validation accuracy: 10.0%
Here is all the relevant code. I'm confident it has nothing to do with the way I've done my optimization (since that is taken from the given task) or my
regularization so I'm not sure where else it could be. I've also played around with the number of nodes in the hidden layers (1024 > 300 > 60) but it does the same thing.
Here is my code (excuse the indentation, it's correct in my code):
batch_size = 128
num_nodes_1 = 768
num_nodes_2 = 1024
num_nodes_3 = 512
dropout_value = 0.5
beta = 0.01
graph = tf.Graph()
with graph.as_default():
tf_train_data = tf.placeholder(tf.float32, shape=(batch_size, image_size*image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_data = tf.constant(valid_dataset)
tf_test_data = tf.constant(test_dataset)
def gen_weights_biases(input_size, output_size):
weights = tf.Variable(tf.truncated_normal([input_size, output_size]))
biases = tf.Variable(tf.zeros([output_size]))
return weights, biases
weights_1, biases_1 = gen_weights_biases(image_size*image_size, num_nodes_1)
weights_2, biases_2 = gen_weights_biases(num_nodes_1, num_nodes_2)
weights_3, biases_3 = gen_weights_biases(num_nodes_2, num_nodes_3)
weights_4, biases_4 = gen_weights_biases(num_nodes_3, num_labels)
logits_1 = tf.matmul(tf_train_data, weights_1) + biases_1
h_layer_1 = tf.nn.relu(logits_1)
h_layer_1 = tf.nn.dropout(h_layer_1, dropout_value)
logits_2 = tf.matmul(h_layer_1, weights_2) + biases_2
h_layer_2 = tf.nn.relu(logits_2)
h_layer_2 = tf.nn.dropout(h_layer_2, dropout_value)
logits_3 = tf.matmul(h_layer_2, weights_3) + biases_3
h_layer_3 = tf.nn.relu(logits_3)
h_layer_3 = tf.nn.dropout(h_layer_3, dropout_value)
logits_4 = tf.matmul(h_layer_3, weights_4) + biases_4
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_4))
regularization = tf.nn.l2_loss(logits_1) + tf.nn.l2_loss(logits_2) + tf.nn.l2_loss(logits_3) + tf.nn.l2_loss(logits_4)
reg_loss = tf.reduce_mean(loss + regularization * beta)
global_step = tf.Variable(0)
learning_rate = tf.train.exponential_decay(0.5, global_step, 750, 0.8)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(reg_loss, global_step=global_step)
train_prediction = tf.nn.softmax(logits_4)
def make_prediction(input_data):
p_logits_1 = tf.matmul(input_data, weights_1) + biases_1
p_layer_1 = tf.nn.relu(p_logits_1)
p_logits_2 = tf.matmul(p_layer_1, weights_2) + biases_2
p_layer_2 = tf.nn.relu(p_logits_2)
p_logits_3 = tf.matmul(p_layer_2, weights_3) + biases_3
p_layer_3 = tf.nn.relu(p_logits_3)
p_logits_4 = tf.matmul(p_layer_3, weights_4) + biases_4
return tf.nn.relu(p_logits_4)
valid_prediction = make_prediction(tf_valid_data)
test_prediction = make_prediction(tf_test_data)
num_steps = 10001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized \n")
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_data:batch_data, tf_train_labels:batch_labels}
_, l, predictions = session.run([optimizer, reg_loss, train_prediction], feed_dict=feed_dict)
if(step % 1 == 0):
print("Offset at step %d: %d" % (step, offset))
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%% \n" % accuracy(valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
Why is this happening, and how do I fix it?
The problem was the standard deviation for the weights. I'm not sure why this sorted it out and if someone could explain I'd appreciate it. Anyway the fix was:
def gen_weights_biases(input_size, output_size):
weights = tf.Variable(tf.truncated_normal([input_size, output_size], stddev=math.sqrt(2.0/(input_size))))
biases = tf.Variable(tf.zeros([output_size]))
return weights, biases
The beta rate also had to be lowered to 0.0001

machine learning using R and randomForestSRC package

I'am trying to use the "surv.randomForestSRC" as the learner of machine learning in R.
My code and results are as below. "newHCC" is the survival data of HCC patients with result of multiple numeric paramaters.
> newHCC$status = (newHCC$status == 1)
> surv.task = makeSurvTask(data = newHCC, target = c("time", "status"))
> surv.task
Supervised task: newHCC
Type: surv
Target: time,status
Events: 61
Observations: 127
Features:
numerics factors ordered
30 0 0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
> lrn = makeLearner("surv.randomForestSRC")
> rdesc = makeResampleDesc(method = "RepCV", folds=10, reps=10)
> r = resample(learner = lrn, task = surv.task, resampling = rdesc)
[Resample] repeated cross-validation iter 1: cindex.test.mean=0.485
[Resample] repeated cross-validation iter 2: cindex.test.mean=0.556
[Resample] repeated cross-validation iter 3: cindex.test.mean=0.825
[Resample] repeated cross-validation iter 4: cindex.test.mean=0.81
...
[Resample] repeated cross-validation iter 100: cindex.test.mean=0.683
[Resample] Aggr. Result: cindex.test.mean=0.688
I have several questions.
How can I check the parameters like used ntree, mtry and so on?
Is there any good way to tune up?
How can I watch the predicted individual risk, things like what we can see when we use predicted of randomForestSRC package?
Many thanks in advance.
and 2. You can try as below
surv_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 100),
makeIntegerParam("mtry", lower = 1, upper = 6),
makeIntegerParam("nodesize", lower = 10, upper = 50),
makeIntegerParam("nsplit", lower = 3, upper = 50)
)
rancontrol <- makeTuneControlRandom(maxit = 10L)
surv_tune <- tuneParams(learner = lrn, resampling = rdesc, task = surv.task,
par.set = surv_param, control = rancontrol)
surv.tree <- setHyperPars(lrn, par.vals = surv_tune$x)
surv <- mlr::train(surv.tree, surv.task)
getLearnerModel(surva)
model <- predict(surv, surv.task)
for today you can not predict individual risk in mlr surv.randomForestSRC. There is just predict type response

The return values of the train_on_batchs in Keras

I am building an RNN in Keras.
def RNN_keras(max_timestep_len, feat_num):
model = Sequential()
model.add(Masking(mask_value=-1.0, input_shape=(max_timestep_len, feat_num)))
model.add(SimpleRNN(input_dim=feat_num, output_dim=128, activation='relu', return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(output_dim = 1, activation='relu')))
sgd = SGD(lr=0.1, decay=1e-6)
model.compile(loss='mean_squared_error',
optimizer=sgd,
metrics=['mean_squared_error'])
return model
for epoch in range(1, NUM_EPOCH+1):
batch_index = 0
for X_batch, y_batch in mig.Xy_gen(mig.X_train, mig.y_train, batch_size=BATCH_SIZE):
batch_index += 1
X_train_pad = sequence.pad_sequences(X_batch, maxlen=mig.ttb.MAX_SEQ_LEN, padding='pre', value=-1.0)
y_train_pad = sequence.pad_sequences(y_batch, maxlen=mig.ttb.MAX_SEQ_LEN, padding='pre', value=-1.0)
loss = rnn.train_on_batch(X_train_pad, y_train_pad)
print("Epoch", epoch, ": Batch", batch_index, "-",
rnn.metrics_names[0], "=", loss[0], "-", rnn.metrics_names[1], "=", loss[1])
The output:
Epoch 1 : Batch 1 - loss = 715.478 - mean_squared_error = 178.191
Epoch 1 : Batch 2 - loss = 1.32964e+12 - mean_squared_error = 2.7457e+11
Epoch 1 : Batch 3 - loss = 2880.08 - mean_squared_error = 594.089
Epoch 1 : Batch 4 - loss = 4065.16 - mean_squared_error = 1031.27
Epoch 1 : Batch 5 - loss = 3489.96 - mean_squared_error = 695.302
Epoch 1 : Batch 6 - loss = 546.395 - mean_squared_error = 147.439
Epoch 1 : Batch 7 - loss = 1353.35 - mean_squared_error = 241.043
Epoch 1 : Batch 8 - loss = 1962.75 - mean_squared_error = 426.699
Epoch 1 : Batch 9 - loss = 2680.85 - mean_squared_error = 504.812
My questions:
Is it normal that the loss is not decreasing by batches?
I set both the loss and metrics to 'mean_squared_error'. Why the outputted loss and mean_square_error are different? Are they calculated based on different sets of the training data?
How should I decide whether I need to use 'pre' padding or 'post' padding? 'Pre' is like adding 'START', while 'post' is like adding 'END'. But based on my understanding, both 'START' and 'END' are important in sequence labeling. Right?
In TimeDistributed Layer, is Y_t also determined by y_t-1, y_t-2,...? Or it is just sequence-version of Dense layer where the output of all time steps are independent?

How to interpret this triangular shape ROC AUC curve?

I have 10+ features and a dozen thousand of cases to train a logistic regression for classifying people's race. First example is French vs non-French, and second example is English vs non-English. The results are as follows:
//////////////////////////////////////////////////////
1= fr
0= non-fr
Class count:
0 69109
1 30891
dtype: int64
Accuracy: 0.95126
Classification report:
precision recall f1-score support
0 0.97 0.96 0.96 34547
1 0.92 0.93 0.92 15453
avg / total 0.95 0.95 0.95 50000
Confusion matrix:
[[33229 1318]
[ 1119 14334]]
AUC= 0.944717975754
//////////////////////////////////////////////////////
1= en
0= non-en
Class count:
0 76125
1 23875
dtype: int64
Accuracy: 0.7675
Classification report:
precision recall f1-score support
0 0.91 0.78 0.84 38245
1 0.50 0.74 0.60 11755
avg / total 0.81 0.77 0.78 50000
Confusion matrix:
[[29677 8568]
[ 3057 8698]]
AUC= 0.757955582999
//////////////////////////////////////////////////////
However, I am getting some very strange looking AUC curves with trianglar shapes instead of jagged round curves. Any explanation as to why I am getting such shape? Any possible mistake I have made?
Codes:
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
+ my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
+ my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
#$$
lr.fit(X_train, y_train)
# Making predictions using trained data
#$$
y_train_predictions = lr.predict(X_train)
#$$
y_test_predictions = lr.predict(X_test)
#print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
print 'Classification report:'
print classification_report(y_test, y_test_predictions)
#print sk_confusion_matrix(y_train, y_train_predictions)
print 'Confusion matrix:'
print sk_confusion_matrix(y_test, y_test_predictions)
#print y_test[1:20]
#print y_test_predictions[1:20]
#print y_test[1:10]
#print np.bincount(y_test)
#print np.bincount(y_test_predictions)
# Find and plot AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC=',roc_auc
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
You're doing it wrong. According to documentation:
y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class or confidence values.
Thus at this line:
roc_curve(y_test, y_test_predictions)
You should pass into roc_curve function result of decision_function (or some of two columns from predict_proba result) instead of actual predictions.
Look at these examples http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

Resources