I am training a model in Keras with as follows:
model.fit(Xtrn, ytrn batch_size=16, epochs=50, verbose=1, shuffle=True,
callbacks=[model_checkpoint], validation_data=(Xval, yval))
The fitting output looks as follows:
As shown in the model.fit I have a batch size of 16 and a total of 8000 training samples as shown in the output. So from my understanding, training takes place every 16 batches. Which also means training is ran 500 times for a single epoch (i.e., 8000/16 =500)
So let's take the training accuracy printed in the output for Epoch 1/50, which in this case is 0.9381. I would like to know how is this training accuracy of 0.9381 derived.
Is it the:
Is the mean training accuracy, taken as the average from the 500 times training, performed for every batch?
OR,
Is it the best (or max) training accuracy from out of the 500 instances the training procedure is run?
Take a look at the BaseLogger in Keras where they're computing a running mean.
For each epoch the accuracy is the average of all the batches seen before in that epoch.
class BaseLogger(Callback):
"""Callback that accumulates epoch averages of metrics.
This callback is automatically applied to every Keras model.
"""
def on_epoch_begin(self, epoch, logs=None):
self.seen = 0
self.totals = {}
def on_batch_end(self, batch, logs=None):
logs = logs or {}
batch_size = logs.get('size', 0)
self.seen += batch_size
for k, v in logs.items():
if k in self.totals:
self.totals[k] += v * batch_size
else:
self.totals[k] = v * batch_size
def on_epoch_end(self, epoch, logs=None):
if logs is not None:
for k in self.params['metrics']:
if k in self.totals:
# Make value available to next callbacks.
logs[k] = self.totals[k] / self.seen
Related
I am training a conditional GAN that generates image time series (similar to video prediction). I built a conditional GAN based on this paper. However, several probelms happened when I was training the cGAN.
Problems of training cGAN:
The discriminator's loss stucks at one.
It seems like the generator's loss is not effected by discriminator no matter how I adjust the hyper parameters related to the discriminator.
Training loss of discriminator
D_loss = (fake_D_loss + true_D_loss) / 2
fake_D_loss = Hinge_loss(D(G(x, z)))
true_D_loss = Hinge_loss(D(x, y))
The margin of hinge loss = 1
Training loss of generator
D_loss = -torch.mean(D(G(x,z))
G_loss = weighted MAE
Gradient flow of discriminator
Gradient flow of generator
Several settings of the cGAN:
The output layer of discriminator is linear sum.
The discriminator is trained twice per epoch while the generator is only trained once.
The number of neurons of the generator and discriminator are exactly the same as the paper.
I replaced the ReLU (original setting) to LeakyReLU to avoid nan.
I added gradient norm to avoid gradient vanishing problem.
Other hyper parameters are listed as follows:
Hyper parameters
Paper
Mine
number of input images
4
4
number of predicted images
18
10
batch size
16
16
opt_g, opt_d
Adam
Adam
lr_g
5e-5
5e-5
lr_d
2e-4
2e-4
The loss function I use for discriminator.
def HingeLoss(pred, validity, margin=1.):
if validity:
loss = F.relu(margin - pred)
else:
loss = F.relu(margin + pred)
return loss.mean()
The loss function for examining the validity of predicted image from generator.
def HingeLossG(pred):
return -torch.mean(pred)
I use the trainer of pytorch_lightning to train the model. The training codes I wrote are as follows.
def training_step(self, batch, batch_idx, optimizer_idx):
x, y = batch
x.requires_grad = True
if self.n_sample > 1:
pred = [self(x) for _ in range(self.n_sample)]
pred = torch.mean(torch.stack(pred, dim=0), dim=0)
else:
pred = self(x)
##### TRAIN DISCRIMINATOR #####
if optimizer_idx == 1:
true_D_loss = self.discriminator_loss(self.discriminator(x, y), True)
fake_D_loss = self.discriminator_loss(self.discriminator(x, pred.detach()), False)
D_loss = (fake_D_loss + true_D_loss) / 2
return D_loss
##### TRAIN GENERATOR #####
if optimizer_idx == 0:
G_loss = self.generator_loss(pred, y)
GD_loss = self.generator_d_loss(self.discriminator(x, pred.detach()))
train_G_loss = G_loss + GD_loss
return train_G_loss
I have several guesses of why these problems may occur:
Since the original model predicts 18 frames rather than 10 frames (my version), maybe the number of neurons in the original generator is too much for my case (predicting 10 frames), leading an exceedingly powerful generator that breaks the balance of training. However, I've tried to lower the learning rate of generator to 1e-5 (original 5e-5) or increase the training times of discriminator to 3 to 5 times. It seems that the loss curve of generator didn't much changed.
Various results of training cGAN
I have also adjust the weights of generator's loss, but the same problems still occurred.
The architecture codes of this model: https://github.com/hyungting/DGMR-pytorch
I am trying to solve a regression problem using pytorch. I have a pre-trained model to start with. When I was tuning hyperparameters, I found my batch size and train/validation loss have a weird correlation. Specifically:
batch size = 16 -\> train/val loss around 0.6 (for epoch 1)
batch size = 64 -\> train/val loss around 0.8 (for epoch 1)
batch size = 128 -\> train/val loss around 1 (for epoch 1)
I want to know if this is normal, or there is something wrong with my code.
optimizer: SGD with learning rate of 1e-3
Loss function:
def rmse(pred, real):
residuals = pred - real
square = torch.square(residuals)
sum_of_square = torch.sum(square)
mean = sum_of_square / pred.shape[0]
root = torch.sqrt(mean)
return root
train loop:
def train_loop(dataloader, model, optimizer, epoch):
num_of_batches = len(dataloader)
total_loss = 0
for batch, (X, y) in enumerate(dataloader):
optimizer.zero_grad()
pred = model(X)
loss = rmse(pred, y)
loss.backward()
optimizer.step()
total_loss += loss.item()
#lr_scheduler.step(epoch*num_of_batches+batch)
#last_lr = lr_scheduler.get_last_lr()[0]
train_loss = total_loss / num_of_batches
return train_loss
test loop:
def test_loop(dataloader, model):
size = len(dataloader.dataset)
num_of_batches = len(dataloader)
test_loss = 0
with torch.no_grad():
for X, y in dataloader:
pred = model(X)
test_loss += rmse(pred, y).item()
test_loss /= num_of_batches
return test_loss
I'll start with an a. analogy, b. dive into the math, and then c. end with a numerical experiment.
a.) What you are witnessing is roughly the same phenomenon as the difference between stochastic and batched gradient descent. In the analog case, the "true" gradient or direction in which the learned parameters should be shifted minimizes the loss over the entire training set of data. In stochastic gradient descent, the gradient shifts the learned parameters in the direction that minimizes the loss for a single example. As the size of the batch is increased from 1 towards the size of the overall dataset, the gradient estimated from the minibatch becomes closer to the gradient for the whole dataset.
Now, is stochastic gradient descent useful at all, given that it is imprecise wrt the whole dataset? Absolutely. In fact, the noise in this estimate can be useful for escaping local minima in the optimization. Analogously, any noise in your estimate of loss wrt the whole dataset is likely nothing to worry about.
b.) But let's next look at why this behavior occurs. RMSE is defined as:
where N is the total number of examples in your dataset. And if RMSE were calculated this way, we would expect the value to be roughly the same (and to approach exactly the same value as N becomes large). However, in your case, you are actually calculating the mean epoch loss as:
where B is the number of minibatches per epoch, and b is the number of examples per minibatch:
Thus, epoch loss is the average RMSE per minibatch. Rearranging, we can see:
when B is large (B = N) and the minibatch size is 1,
which clearly has quite different properties than RMSE defined above. However, as B becomes small B = 1, and minibatch size is N,
which is exactly equal to RMSE above. So as you increase the batch size, the expected value for the quantity you compute moves between these two expressions. This explains the (roughly square root) scaling of your loss with different minibatch sizes. Epoch loss is an estimate of RMSE (which can be thought of as the standard deviation of model prediction error). One training goal could be to drive this error standard deviation to zero, but your expression for epoch loss is also likely a good proxy for this. And both quantities are themselves proxies for whatever model performance you actually hope to obtain.
c. You can try this for yourself with a trivial toy problem. A normal distribution is used as a proxy for model error.
EXAMPLE 1: Compute RMSE for whole dataset ( of size 10000 x b)
import torch
for b in [1,2,3,5,9,10,100,1000,10000,100000]:
b_errors = []
for i in range (10000):
error = torch.normal(0,100,size = (1,b))
error = error **2
error = error.mean()
b_errors.append(error)
RMSE = torch.sqrt(sum(b_errors)/len(b_errors))
print("Average RMSE for b = {}: {}".format(N,RMSE))
Result:
Average RMSE for b = 1: 99.94982147216797
Average RMSE for b = 2: 100.38357543945312
Average RMSE for b = 3: 100.24600982666016
Average RMSE for b = 5: 100.97154998779297
Average RMSE for b = 9: 100.06820678710938
Average RMSE for b = 10: 100.12358856201172
Average RMSE for b = 100: 99.94219970703125
Average RMSE for b = 1000: 99.97941589355469
Average RMSE for b = 10000: 100.00338745117188
EXAMPLE 2: Compute Epoch Loss with B = 10000
import torch
for b in [1,2,3,5,9,10,100,1000,10000,100000]:
b_errors = []
for i in range (10000):
error = torch.normal(0,100,size = (1,b))
error = error **2
error = error.mean()
error = torch.sqrt(error)
b_errors.append(error)
avg = (sum(b_errors)/len(b_errors)
print("Average Epoch Loss for b = {}: {}".format(b,avg))
Result:
Average Epoch Loss for b = 1: 80.95650482177734
Average Epoch Loss for b = 2: 88.734375
Average Epoch Loss for b = 3: 92.08515930175781
Average Epoch Loss for b = 5: 95.56260681152344
Average Epoch Loss for b = 9: 97.49445343017578
Average Epoch Loss for b = 10: 97.20250701904297
Average Epoch Loss for b = 100: 99.6297607421875
Average Epoch Loss for b = 1000: 99.96969604492188
Average Epoch Loss for b = 10000: 99.99618530273438
Average Epoch Loss for b = 100000: 100.00079345703125
The first batch of the first epoch is always going to be pretty inconsistent between runs unless you setup a manual rng seed. Your loss is a result of how well your randomly initialized weights do with your randomly subsampled batch of training items. In other words, its meaningless (in this context) what your loss is on this first go-around, regardless of batch-size.
what is the standard way to detect if a model has converged? I was going to record 5 losses with 95 confidence intervals each loss and if they all agreed then I’d halt the script. I assume training until convergence must be implemented already in PyTorch or PyTorch Lightning somewhere. I don’t need a perfect solution, just the standard way to do this automatically - i.e. halt when converged.
My solution is easy to implement. Once create a criterion and changes the reduction to none. Then it will output a tensor of size [B]. Every you log you record that and it's 95 confidence interval (or std if you prefer, but that is much less accuracy). Then every time you add a new loss with it's confidence interval make sure it remains of size 5 (or 10) and that the 5 losses are within a 95 CI of each other. Then if that is true halt.
You can compute the CI with this:
def torch_compute_confidence_interval(data: Tensor,
confidence: float = 0.95
) -> Tensor:
"""
Computes the confidence interval for a given survey of a data set.
"""
n = len(data)
mean: Tensor = data.mean()
# se: Tensor = scipy.stats.sem(data) # compute standard error
# se, mean: Tensor = torch.std_mean(data, unbiased=True) # compute standard error
se: Tensor = data.std(unbiased=True) / (n**0.5)
t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
ci = t_p * se
return mean, ci
and you can create the criterion as follow:
loss: nn.Module = nn.CrossEntropyLoss(reduction='none')
so the train loss is now of size [B].
note that I know how to train with a fixed number of epochs, so I am not really looking for that - just the halting criterion for when to stop when models looks converged, what a person would sort of do when they look at their learning curve but automatically.
ref:
https://forums.pytorchlightning.ai/t/what-is-the-standard-way-to-halt-a-script-when-it-has-converged/1415
Set an EarlyStopping (https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.EarlyStopping.html#pytorch_lightning.callbacks.EarlyStopping) callback in your trainer by
checkpoint_callbacks = [
EarlyStopping(
monitor="val_f1_score",
min_delta=0.01,
patience=10, # NOTE no. val epochs, not train epochs
verbose=False,
mode="min",
),
]
trainer = pl.Trainer(callbacks=callbacks)
This will monitor changes in val_f1_score during training (notice that you have to log this value with self.log("val_f1_score", val_f1) in your pl.LightningModule). And it will stop the training if the minimum change to quantity to qualify as an improvement (min_delta) for more than the number of epoch specified as patience
I am new to PyTorch and I'm trying to build a simple neural net for classification. The problem is the network doesn't learn at all. I tried various learning rate ranging from 0.3 to 1e-8 and I also tried training it for a longer duration. My data is small with only 120 training examples and the batch size is 16. Here is the code
Define network
model = nn.Sequential(nn.Linear(4999, 1000),
nn.ReLU(),
nn.Linear(1000,200),
nn.ReLU(),
nn.Linear(200,1),
nn.Sigmoid())
Loss and optimizer
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.BCELoss(reduction="mean")
Training
num_epochs = 100
for epoch in range(num_epochs):
cumulative_loss = 0
for i, data in enumerate(batch_gen(X_train, y_train, batch_size=16)):
inputs, labels = data
inputs = torch.from_numpy(inputs).float()
labels = torch.from_numpy(labels).float()
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
cumulative_loss += loss.item()
if i%5 == 0 and i != 0:
print(f"epoch {epoch} batch {i} => Loss: {cumulative_loss/5}")
print("Finished Training!!")
Any help is appreciated!
The reason your loss doesn't seem to decrease every epoch is because you're not printing it every epoch. You're actually printing it every 5th batch. And the loss does not decrease a lot per batch.
Try the following. Here, loss every epoch will be printed.
num_epochs = 100
for epoch in range(num_epochs):
cumulative_loss = 0
for i, data in enumerate(batch_gen(X_train, y_train, batch_size=16)):
inputs, labels = data
inputs = torch.from_numpy(inputs).float()
labels = torch.from_numpy(labels).float()
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
cumulative_loss += loss.item()
print(f"epoch {epoch} => Loss: {cumulative_loss}")
print("Finished Training!!")
One reason that your loss doesn't decrease could be because your neural-net isn't deep enough to learn anything. So, trying add more layers.
model = nn.Sequential(nn.Linear(4999, 3000),
nn.ReLU(),
nn.Linear(3000,200),
nn.ReLU(),
nn.Linear(2000,1000),
nn.ReLU(),
nn.Linear(500,250),
nn.ReLU(),
nn.Linear(250,1),
nn.Sigmoid())
Also, I just noticed you're passing data that has very high dimensionality. You have 4999 features/columns and only 120 training examples/rows. Converging a model with so less data is next to impossible (considering you have very high dimensional data).
I'd suggest you try finding more rows or perform dimensionality reduction on your input data (like PCA) to reduce the feature space (to maybe 50/100 or lesser features) and then try again. Chances are that your model still won't converge but it's worth a try.
I restored a pre-trained model in Tensorflow 1.2 to do some testing work. I assumed the model was well-trained since the loss decreased to very low (0.0001). However, with either the testing samples or training samples, the accuracy ops give me a value which is almost 0. Is this because I'm using the wrong accuracy function or is it because the model is the problem?
Here is the accuracy function, the test_image below is a batch with a single test sample, test_image_label is a single label:
correct_prediction = tf.equal(tf.argmax(GoogleNet(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
with Session() as less:
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={keep_prob: 1.0}))
print(accuracy_vector)
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
The model is defined as GoogleNet(data) above, it is a function that returns the logits of the input batch. The training was done like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=GoogleNet(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, global_step=global_step)
The train_step is ran in every iteration. I think it is worth noting that after restored the model, I cannot run print(GoogleNet(test_image).eval(feed_dict={keep_prob: 1.0})) in the session, with which I intended to take a look at the output of the model. It returns the error of FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_213
[[Node: Variable_213/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_213"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_213)]]