Is validation set essential in machine learning research paper? - machine-learning

I know that validation set is very import since in brief we can check our model and tune optimal hyperparameters without handling the test set. However, there are some papers about machine learning that had not made validation set. For example, in Neural Collaborative Filtering authors mentioned that they use validation set to tune hyperparameters, but on provided code they just trained a model with train set and validated it with test set for each epoch.
for epoch in xrange(epochs):
t1 = time()
# Generate training instances
user_input, item_input, labels = get_train_instances(train, num_negatives)
# Training
hist = model.fit([np.array(user_input), np.array(item_input)], #input
np.array(labels), # labels
batch_size=batch_size, nb_epoch=1, verbose=0, shuffle=True)
t2 = time()
# Evaluation
if epoch %verbose == 0:
(hits, ndcgs) = evaluate_model(model, testRatings, testNegatives, topK, evaluation_threads)
hr, ndcg, loss = np.array(hits).mean(), np.array(ndcgs).mean(), hist.history['loss'][0]
print('Iteration %d [%.1f s]: HR = %.4f, NDCG = %.4f, loss = %.4f [%.1f s]'
% (epoch, t2-t1, hr, ndcg, loss, time()-t2))
if hr > best_hr:
best_hr, best_ndcg, best_iter = hr, ndcg, epoch
if args.out > 0:
model.save_weights(model_out_file, overwrite=True)
I couldn't find any code of making or treating validation set in that github.
Is it okay for just using test set instead of validation set for any kinds of machine learning research? Meanwhile, in my opinion, we should use validation set for competitions like Kaggle or real-life system.

Related

Discriminator's loss stuck at value = 1 while training conditional GAN

I am training a conditional GAN that generates image time series (similar to video prediction). I built a conditional GAN based on this paper. However, several probelms happened when I was training the cGAN.
Problems of training cGAN:
The discriminator's loss stucks at one.
It seems like the generator's loss is not effected by discriminator no matter how I adjust the hyper parameters related to the discriminator.
Training loss of discriminator
D_loss = (fake_D_loss + true_D_loss) / 2
fake_D_loss = Hinge_loss(D(G(x, z)))
true_D_loss = Hinge_loss(D(x, y))
The margin of hinge loss = 1
Training loss of generator
D_loss = -torch.mean(D(G(x,z))
G_loss = weighted MAE
Gradient flow of discriminator
Gradient flow of generator
Several settings of the cGAN:
The output layer of discriminator is linear sum.
The discriminator is trained twice per epoch while the generator is only trained once.
The number of neurons of the generator and discriminator are exactly the same as the paper.
I replaced the ReLU (original setting) to LeakyReLU to avoid nan.
I added gradient norm to avoid gradient vanishing problem.
Other hyper parameters are listed as follows:
Hyper parameters
Paper
Mine
number of input images
4
4
number of predicted images
18
10
batch size
16
16
opt_g, opt_d
Adam
Adam
lr_g
5e-5
5e-5
lr_d
2e-4
2e-4
The loss function I use for discriminator.
def HingeLoss(pred, validity, margin=1.):
if validity:
loss = F.relu(margin - pred)
else:
loss = F.relu(margin + pred)
return loss.mean()
The loss function for examining the validity of predicted image from generator.
def HingeLossG(pred):
return -torch.mean(pred)
I use the trainer of pytorch_lightning to train the model. The training codes I wrote are as follows.
def training_step(self, batch, batch_idx, optimizer_idx):
x, y = batch
x.requires_grad = True
if self.n_sample > 1:
pred = [self(x) for _ in range(self.n_sample)]
pred = torch.mean(torch.stack(pred, dim=0), dim=0)
else:
pred = self(x)
##### TRAIN DISCRIMINATOR #####
if optimizer_idx == 1:
true_D_loss = self.discriminator_loss(self.discriminator(x, y), True)
fake_D_loss = self.discriminator_loss(self.discriminator(x, pred.detach()), False)
D_loss = (fake_D_loss + true_D_loss) / 2
return D_loss
##### TRAIN GENERATOR #####
if optimizer_idx == 0:
G_loss = self.generator_loss(pred, y)
GD_loss = self.generator_d_loss(self.discriminator(x, pred.detach()))
train_G_loss = G_loss + GD_loss
return train_G_loss
I have several guesses of why these problems may occur:
Since the original model predicts 18 frames rather than 10 frames (my version), maybe the number of neurons in the original generator is too much for my case (predicting 10 frames), leading an exceedingly powerful generator that breaks the balance of training. However, I've tried to lower the learning rate of generator to 1e-5 (original 5e-5) or increase the training times of discriminator to 3 to 5 times. It seems that the loss curve of generator didn't much changed.
Various results of training cGAN
I have also adjust the weights of generator's loss, but the same problems still occurred.
The architecture codes of this model: https://github.com/hyungting/DGMR-pytorch

What is the standard way to train a PyTorch script until convergence?

what is the standard way to detect if a model has converged? I was going to record 5 losses with 95 confidence intervals each loss and if they all agreed then I’d halt the script. I assume training until convergence must be implemented already in PyTorch or PyTorch Lightning somewhere. I don’t need a perfect solution, just the standard way to do this automatically - i.e. halt when converged.
My solution is easy to implement. Once create a criterion and changes the reduction to none. Then it will output a tensor of size [B]. Every you log you record that and it's 95 confidence interval (or std if you prefer, but that is much less accuracy). Then every time you add a new loss with it's confidence interval make sure it remains of size 5 (or 10) and that the 5 losses are within a 95 CI of each other. Then if that is true halt.
You can compute the CI with this:
def torch_compute_confidence_interval(data: Tensor,
confidence: float = 0.95
) -> Tensor:
"""
Computes the confidence interval for a given survey of a data set.
"""
n = len(data)
mean: Tensor = data.mean()
# se: Tensor = scipy.stats.sem(data) # compute standard error
# se, mean: Tensor = torch.std_mean(data, unbiased=True) # compute standard error
se: Tensor = data.std(unbiased=True) / (n**0.5)
t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
ci = t_p * se
return mean, ci
and you can create the criterion as follow:
loss: nn.Module = nn.CrossEntropyLoss(reduction='none')
so the train loss is now of size [B].
note that I know how to train with a fixed number of epochs, so I am not really looking for that - just the halting criterion for when to stop when models looks converged, what a person would sort of do when they look at their learning curve but automatically.
ref:
https://forums.pytorchlightning.ai/t/what-is-the-standard-way-to-halt-a-script-when-it-has-converged/1415
Set an EarlyStopping (https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.EarlyStopping.html#pytorch_lightning.callbacks.EarlyStopping) callback in your trainer by
checkpoint_callbacks = [
EarlyStopping(
monitor="val_f1_score",
min_delta=0.01,
patience=10, # NOTE no. val epochs, not train epochs
verbose=False,
mode="min",
),
]
trainer = pl.Trainer(callbacks=callbacks)
This will monitor changes in val_f1_score during training (notice that you have to log this value with self.log("val_f1_score", val_f1) in your pl.LightningModule). And it will stop the training if the minimum change to quantity to qualify as an improvement (min_delta) for more than the number of epoch specified as patience

Validation accuracy fluctuating while training accuracy increase?

I have a multiclassification problem that depends on historical data. I am trying LSTM using loss='sparse_categorical_crossentropy'. The train accuracy and loss increase and decrease respectively. But, my test accuracy starts to fluctuate wildly.
What I am doing wrong?
Input data:
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
X.shape
(200146, 13, 1)
My model
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
cvscores = []
for train, test in kfold.split(X, y):
regressor = Sequential()
# Units = the number of LSTM that we want to have in this first layer -> we want very high dimentionality, we need high number
# return_sequences = True because we are adding another layer after this
# input shape = the last two dimensions and the indicator
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X[train].shape[1], 1)))
regressor.add(Dropout(0.2))
# Extra LSTM layer
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
# 3rd
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
#4th
regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))
# output layer
regressor.add(Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
# Compile the RNN
regressor.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=9),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
history = regressor.fit(X[train], y[train], epochs=250, callbacks=callbacks,
validation_data=(X[test], y[test]))
# plot train and validation loss
pyplot.plot(history.history['loss'])
pyplot.plot(history.history['val_loss'])
pyplot.title('model train vs validation loss')
pyplot.ylabel('loss')
pyplot.xlabel('epoch')
pyplot.legend(['train', 'validation'], loc='upper right')
pyplot.show()
# evaluate the model
scores = regressor.evaluate(X[test], y[test], verbose=0)
print("%s: %.2f%%" % (regressor.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
Results:
trainingmodel
Plot
What you are describing here is overfitting. This means your model keeps learning about your training data and doesn't generalize, or other said it is learning the exact features of your training set. This is the main problem you can deal with in deep learning. There is no solution per se. You have to try out different architectures, different hyperparameters and so on.
You can try with a small model that underfits (that is the train acc and validation are at low percentage) and keep increasing your model until it overfits. Then you can play around with the optimizer and other hyperparameters.
By smaller model I mean one with fewer hidden units or fewer layers.
you seem to have too many LSTM layers stacked over and over again which eventually leads to overfitting. Probably should decrease the num of layers.
Your model seems to be overfitting, since the training error keeps on reducing while validation error fails to. Overall, it fails to generalize.
You should try reducing the model complexity by removing some of the LSTM layers. Also, try varying the batch sizes, it will reduce the number of fluctuations in the loss.
You can also consider varying the learning rate.

Cross Validation in Keras

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

how the generator is trained with the output of discriminator in Generative adversarial Networks

Recently I have learned about Generative Adversarial Networks.
For training the Generator, I am somehow confused how it learns. Here is an implemenation of GANs:
`# train generator
z = Variable(xp.random.uniform(-1, 1, (batchsize, nz), dtype=np.float32))
x = gen(z)
yl = dis(x)
L_gen = F.softmax_cross_entropy(yl, Variable(xp.zeros(batchsize, dtype=np.int32)))
L_dis = F.softmax_cross_entropy(yl, Variable(xp.ones(batchsize, dtype=np.int32)))
# train discriminator
x2 = Variable(cuda.to_gpu(x2))
yl2 = dis(x2)
L_dis += F.softmax_cross_entropy(yl2, Variable(xp.zeros(batchsize, dtype=np.int32)))
#print "forward done"
o_gen.zero_grads()
L_gen.backward()
o_gen.update()
o_dis.zero_grads()
L_dis.backward()
o_dis.update()`
So it computes a loss for the Generator as it is mentioned in the paper.
However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array).
But we know that in general, for training a network, we compute a loss function in the last layer (a loss between the last layers output and the real output) and then we compute the gradients. So for example, if the output is 64*64, then we compare it with a 64*64 image and then compute the loss and do the back propagation.
However, in the codes that I see in Generative Adversarial Networks, I see they compute a loss for the Generator from the discriminator output (which is just a number) and then they call the back propagation for Generator. The Generators last layers is for example 64*64 pixels but the discriminator loss is 1*1 (which is different from the usual networks) So I do not understand how it cause the Generator to be learned and trained?
I thought if we attach the two networks (attaching the Generator and Discriminator) and then call the back propagation but just update the Generators parameters, it makes sense and it should work. But what I see in the codes are totally different.
So I am asking how it is possible?
Thanks
You say 'However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array)' whereas the loss is always a scalar value. When we compute mean square error of two images it is also a scalar value.
L_adversarial = E[log(D(x))]+E[log(1−D(G(z))]
x is from real data distribution
z is the latent data distribution which is transformed by the Generator
Coming back to your actual question, The Discriminator network has a sigmoid activation function in the last layer which means it outputs in the range [0,1]. Discriminator tries to maximize this loss by maximizing both terms that are added in the loss function. Maximum value of first term is 0 and occurs when D(x) is 1 and maximum value of second term is also 0 and occurs when 1-D(G(z)) is 1 which means D(G(z)) is 0. So Discriminator tries to do a binary classification my maximizing this loss function where it tries to output 1 when it is fed x(real data) and 0 when it is fed G(z)(generated fake data).
But the Generator tries to minimize this loss in other words it tries to fool the Discriminator by generating fake samples which are similar to real samples. With time both Generator and Discriminator gets better and better. This is the intuition behind GAN.
The code is in pytorch
bce_loss = nn.BCELoss() #bce_loss = -ylog(y_hat)-(1-y)log(1-y_hat)[similar to L_adversarial]
Discriminator = ..... #some network
Generator = ..... #some network
optimizer_generator = ....... #some optimizer for generator network
optimizer_discriminator = ....... #some optimizer for discriminator network
z = ...... #some latent data distribution that is transformed by the generator
real = ..... #real data distribution
#####################
#Update Discriminator
#####################
fake = Generator(z)
fake_prediction = Discriminator(fake)
real_prediction = Discriminator(real)
discriminator_loss = bce_loss(fake_prediction,torch.zeros(batch_size))+bce_loss(real_prediction,torch.ones(batch_size))
discriminator_loss.backward()
optimizer_discriminator.step()
#################
#Update Generator
#################
fake = Generator(z)
fake_prediction = Discriminator(fake)
generator_loss = bce_loss(fake_prediction,torch.ones(batch_size))
generator_loss.backward()
optimizer_generator.step()

Resources