sampled_softmax_loss not decreasing while sequence_loss_by_example does - machine-learning

I am training an LSTM and using sampled_softmax_loss to compute the loss after each epoch (so many documents). I also compute the perplexity on a held-out set at the same time with sequence_loss_by_example.
The loss decreases for the first few epochs -- radically going down from 1-2. Then it just hangs around the same value (sometimes a little less; sometimes a little more). On the other hand, the perplexity does decrease consistently.
Why would loss stop decreasing while perplexity continues to go down? I expected both of them to decrease consistently.
Code looks something like this:
total_steps = 0
total_cost = 0.
for batch in train_epoch:
total_steps += num_steps
loss = tf.nn.sampled_softmax_loss(...)
cost = tf.reduce_sum(loss) / batch_size
total_cost += cost
...
optimizer.apply_gradients(tf.gradients(cost, vars),...)
print("average loss = {}".format(total_cost / total_steps))
total_steps = 0
total_xentropy = 0.
for batch in valid_epoch:
total_steps += num_steps
loss = tf.nn.seq2seq.sequence_loss_by_example(...)
total_xentropy += tf.reduce_sum(loss) / batch_size
print("perplexity = {}".format(np.exp(total_xentropy / total_steps))

This observed behavior was resolved by reducing the learning rate. After that change, training loss and validation perplexity moved (mostly) in tandem.

Related

Larger batch size cause larger loss

I am trying to solve a regression problem using pytorch. I have a pre-trained model to start with. When I was tuning hyperparameters, I found my batch size and train/validation loss have a weird correlation. Specifically:
batch size = 16 -\> train/val loss around 0.6 (for epoch 1)
batch size = 64 -\> train/val loss around 0.8 (for epoch 1)
batch size = 128 -\> train/val loss around 1 (for epoch 1)
I want to know if this is normal, or there is something wrong with my code.
optimizer: SGD with learning rate of 1e-3
Loss function:
def rmse(pred, real):
residuals = pred - real
square = torch.square(residuals)
sum_of_square = torch.sum(square)
mean = sum_of_square / pred.shape[0]
root = torch.sqrt(mean)
return root
train loop:
def train_loop(dataloader, model, optimizer, epoch):
num_of_batches = len(dataloader)
total_loss = 0
for batch, (X, y) in enumerate(dataloader):
optimizer.zero_grad()
pred = model(X)
loss = rmse(pred, y)
loss.backward()
optimizer.step()
total_loss += loss.item()
#lr_scheduler.step(epoch*num_of_batches+batch)
#last_lr = lr_scheduler.get_last_lr()[0]
train_loss = total_loss / num_of_batches
return train_loss
test loop:
def test_loop(dataloader, model):
size = len(dataloader.dataset)
num_of_batches = len(dataloader)
test_loss = 0
with torch.no_grad():
for X, y in dataloader:
pred = model(X)
test_loss += rmse(pred, y).item()
test_loss /= num_of_batches
return test_loss
I'll start with an a. analogy, b. dive into the math, and then c. end with a numerical experiment.
a.) What you are witnessing is roughly the same phenomenon as the difference between stochastic and batched gradient descent. In the analog case, the "true" gradient or direction in which the learned parameters should be shifted minimizes the loss over the entire training set of data. In stochastic gradient descent, the gradient shifts the learned parameters in the direction that minimizes the loss for a single example. As the size of the batch is increased from 1 towards the size of the overall dataset, the gradient estimated from the minibatch becomes closer to the gradient for the whole dataset.
Now, is stochastic gradient descent useful at all, given that it is imprecise wrt the whole dataset? Absolutely. In fact, the noise in this estimate can be useful for escaping local minima in the optimization. Analogously, any noise in your estimate of loss wrt the whole dataset is likely nothing to worry about.
b.) But let's next look at why this behavior occurs. RMSE is defined as:
where N is the total number of examples in your dataset. And if RMSE were calculated this way, we would expect the value to be roughly the same (and to approach exactly the same value as N becomes large). However, in your case, you are actually calculating the mean epoch loss as:
where B is the number of minibatches per epoch, and b is the number of examples per minibatch:
Thus, epoch loss is the average RMSE per minibatch. Rearranging, we can see:
when B is large (B = N) and the minibatch size is 1,
which clearly has quite different properties than RMSE defined above. However, as B becomes small B = 1, and minibatch size is N,
which is exactly equal to RMSE above. So as you increase the batch size, the expected value for the quantity you compute moves between these two expressions. This explains the (roughly square root) scaling of your loss with different minibatch sizes. Epoch loss is an estimate of RMSE (which can be thought of as the standard deviation of model prediction error). One training goal could be to drive this error standard deviation to zero, but your expression for epoch loss is also likely a good proxy for this. And both quantities are themselves proxies for whatever model performance you actually hope to obtain.
c. You can try this for yourself with a trivial toy problem. A normal distribution is used as a proxy for model error.
EXAMPLE 1: Compute RMSE for whole dataset ( of size 10000 x b)
import torch
for b in [1,2,3,5,9,10,100,1000,10000,100000]:
b_errors = []
for i in range (10000):
error = torch.normal(0,100,size = (1,b))
error = error **2
error = error.mean()
b_errors.append(error)
RMSE = torch.sqrt(sum(b_errors)/len(b_errors))
print("Average RMSE for b = {}: {}".format(N,RMSE))
Result:
Average RMSE for b = 1: 99.94982147216797
Average RMSE for b = 2: 100.38357543945312
Average RMSE for b = 3: 100.24600982666016
Average RMSE for b = 5: 100.97154998779297
Average RMSE for b = 9: 100.06820678710938
Average RMSE for b = 10: 100.12358856201172
Average RMSE for b = 100: 99.94219970703125
Average RMSE for b = 1000: 99.97941589355469
Average RMSE for b = 10000: 100.00338745117188
EXAMPLE 2: Compute Epoch Loss with B = 10000
import torch
for b in [1,2,3,5,9,10,100,1000,10000,100000]:
b_errors = []
for i in range (10000):
error = torch.normal(0,100,size = (1,b))
error = error **2
error = error.mean()
error = torch.sqrt(error)
b_errors.append(error)
avg = (sum(b_errors)/len(b_errors)
print("Average Epoch Loss for b = {}: {}".format(b,avg))
Result:
Average Epoch Loss for b = 1: 80.95650482177734
Average Epoch Loss for b = 2: 88.734375
Average Epoch Loss for b = 3: 92.08515930175781
Average Epoch Loss for b = 5: 95.56260681152344
Average Epoch Loss for b = 9: 97.49445343017578
Average Epoch Loss for b = 10: 97.20250701904297
Average Epoch Loss for b = 100: 99.6297607421875
Average Epoch Loss for b = 1000: 99.96969604492188
Average Epoch Loss for b = 10000: 99.99618530273438
Average Epoch Loss for b = 100000: 100.00079345703125
The first batch of the first epoch is always going to be pretty inconsistent between runs unless you setup a manual rng seed. Your loss is a result of how well your randomly initialized weights do with your randomly subsampled batch of training items. In other words, its meaningless (in this context) what your loss is on this first go-around, regardless of batch-size.

Nan as training loss in a CNN

I'm trying to train a small CNN from scratch to classify images of 10 different animal species. The images have different dimensions, but I'd say around 300x300. Anyway, every image is resized to 224x224 before going into the model.
Here is the network I'm training:
# Convolution 1
self.cnn1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=0)
self.relu1 = nn.ReLU()
# Max pool 1
self.maxpool1 = nn.MaxPool2d(kernel_size=2)
# Convolution 2
self.cnn2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=0)
self.relu2 = nn.ReLU()
# Max pool 2
self.maxpool2 = nn.MaxPool2d(kernel_size=2)
# Fully connected 1
self.fc1 = nn.Linear(32 * 54 * 54, 10)
I'm using a SGD optimizer with fixed learning rate = 0.005 and weight decay = 0.01. I'm using a cross entropy function.
The accuracy of the model is good (around 99% after the 43-th epoch). However:
in some epoch I get a 'nan' as training loss
in some other epoch the accuracy drops significantly (sometimes the two happen in the same epoch). However, in the next epoch the accuracy comes back to a normal level.
If I understood it correctly a nan in training loss most of the times is caused by gradient values getting too small (underflow) or too big (overflow). Could this be the case?
Should I try by increasing the weight decay to 0.05? Or should I do gradient clipping to avoid exploding gradients? If so which would be a reasonable bound?
Still I don't understand the second issue.

BCEWithLogitsLoss Multi-label Classification

I'm a bit confused about how to accumulate the batch losses to obtain the epoch loss.
Two questions:
Is #1 (see comments below) correct way to calculate loss with masks)
Is #2 correct way to report epoch loss)
optimizer = torch.optim.Adam(model.parameters, lr=1e-3, weight_decay=1e-5)
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
for epoch in range(10):
EPOCH_LOSS = 0.
for inputs, gt_labels, masks in training_dataloader:
optimizer.zero_grad()
outputs = model(inputs)
#1: Is this the correct way to calculate batch loss? Do I multiply batch_loss with outputs.shape[0[ before adding it to epoch_loss?
batch_loss = (masks * criterion(outputs, gt_labels.float())).mean()
EPOCH_LOSS += batch_loss
loss.backward()
optimizer.step()
#2: then what do I do here? Do I divide the EPOCH_LOSS with len(training_dataloader)?
print(f'EPOCH LOSS: {EPOCH_LOSS/len(training_dataloader)}:.3f')
In your criterion, you have got the default reduction field set (see the docs), so your masking approach won't work. You should use your masking one step earlier (prior to the loss calculation) like so:
batch_loss = (criterion(outputs*masks, gt_labels.float()*masks)).mean()
OR
batch_loss = (criterion(outputs[masks], gt_labels.float()[masks])).mean()
But, without seeing your data it might be a different format. You might want to check that this is working as expected.
In regards to your actual question, it depends on how you want to represent your data. What I would do is just to sum all of the batches' losses and represent that, but you can choose to divide by the number of batches if you want to represent the AVERAGE loss of each batch in the epoch.
Because this is purely an illustrative property of your model, it actually doesn't matter which one you pick, as long as it's consistent between epochs to represent the fact that your model is learning.

What is the standard way to train a PyTorch script until convergence?

what is the standard way to detect if a model has converged? I was going to record 5 losses with 95 confidence intervals each loss and if they all agreed then I’d halt the script. I assume training until convergence must be implemented already in PyTorch or PyTorch Lightning somewhere. I don’t need a perfect solution, just the standard way to do this automatically - i.e. halt when converged.
My solution is easy to implement. Once create a criterion and changes the reduction to none. Then it will output a tensor of size [B]. Every you log you record that and it's 95 confidence interval (or std if you prefer, but that is much less accuracy). Then every time you add a new loss with it's confidence interval make sure it remains of size 5 (or 10) and that the 5 losses are within a 95 CI of each other. Then if that is true halt.
You can compute the CI with this:
def torch_compute_confidence_interval(data: Tensor,
confidence: float = 0.95
) -> Tensor:
"""
Computes the confidence interval for a given survey of a data set.
"""
n = len(data)
mean: Tensor = data.mean()
# se: Tensor = scipy.stats.sem(data) # compute standard error
# se, mean: Tensor = torch.std_mean(data, unbiased=True) # compute standard error
se: Tensor = data.std(unbiased=True) / (n**0.5)
t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
ci = t_p * se
return mean, ci
and you can create the criterion as follow:
loss: nn.Module = nn.CrossEntropyLoss(reduction='none')
so the train loss is now of size [B].
note that I know how to train with a fixed number of epochs, so I am not really looking for that - just the halting criterion for when to stop when models looks converged, what a person would sort of do when they look at their learning curve but automatically.
ref:
https://forums.pytorchlightning.ai/t/what-is-the-standard-way-to-halt-a-script-when-it-has-converged/1415
Set an EarlyStopping (https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.EarlyStopping.html#pytorch_lightning.callbacks.EarlyStopping) callback in your trainer by
checkpoint_callbacks = [
EarlyStopping(
monitor="val_f1_score",
min_delta=0.01,
patience=10, # NOTE no. val epochs, not train epochs
verbose=False,
mode="min",
),
]
trainer = pl.Trainer(callbacks=callbacks)
This will monitor changes in val_f1_score during training (notice that you have to log this value with self.log("val_f1_score", val_f1) in your pl.LightningModule). And it will stop the training if the minimum change to quantity to qualify as an improvement (min_delta) for more than the number of epoch specified as patience

Neural Network Trains Fine and Test Predictions are Horrible Bordering on Ridiculous

I am having a lot of trouble with a neural network model using R neuralnet() function. When I train a network on all of the data as expected the predictions are very accurate. However, when I split the data into training and test sets, the test predictions are terrible. I cannot figure what all I am doing wrong. I would appreciate any advice or help troubleshooting as I don't think I'll be able to figure this out on my own. Thanks in advance.
I have included the R code and some plots and an example of the data below the full data is 3600 observations.
Best Regards-Pat
UPDATE 05/12/18: BASED ON FEEDBACK THAT THIS LOOKS LIKE OVERFITTING, I TRIED STOPPING THE TRAINING EARLIER AND FOUND THAT THE MSE OF THE TEST PREDICTION NEVER GETS VERY LOW AND IS LOWEST APPROACHING 0 TRAINING EPOCHS AND RISES FROM THERE (PLOT INCLUDED AND CODE APPENDED)
###########
#ANN Models
###########
#Load libraries
library(plyr)
library(ggplot2)
library(gridExtra)
library(neuralnet)
#Retain only numerically coded data from data1 in (data2) for ANN fitting
data2 = data1[,c(3:7)]
#Calculate Min and Max for Scaling
max_data = apply(data2,2,max)
min_data = apply(data2,2,min)
#Scale data 0-1
data2_scaled = scale(data2,center=min_data,scale=max_data-min_data)
#Check data structure
data2_scaled
#Fit neural net model
model_nn1 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=data2_scaled,hidden=c(8,8),stepmax=1000000,threshold=0.01)
#Calculate Min and Max Response for rescaling
max_time = max(data2$time)
min_time = min(data2$time)
#Rescale neural net response predictions
pred_nn1 = model_nn1$net.result[[1]][,1]*(max_time-min_time)+min_time
#Compare model predictions to actual values
a03 = cbind.data.frame(data1$time,pred_nn1,data1$machine,data1$app)
colnames(a03) = c("actual","prediction","machine","app")
attach(a03)
p01 = ggplot(a03,aes(x=actual,y=prediction))+
geom_point(aes(color=machine),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (ALL DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
p02 = ggplot(a03,aes(x=actual,y=prediction))+
geom_point(aes(color=app),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (ALL DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
grid.arrange(p01,p02,nrow=1)
#Visualize ANN
plot(model_nn1)
#Epochs taken to train "steps"
model_nn1$result.matrix[3,]
#########################
#Testing and Training ANN
#########################>
#Split the data into a test and training set
index = sample(1:nrow(data2_scaled),round(0.80*nrow(data2_scaled)))
train_data = as.data.frame(data2_scaled[index,])
test_data = as.data.frame(data2_scaled[-index,])
model_nn2 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=train_data,hidden=c(3,2),stepmax=1000000,threshold=0.01)
pred_nn2_scaled = compute(model_nn2,test_data[,c(1,2,4,5)])
pred_nn2 = pred_nn2_scaled$net.result*(max_time-min_time)+min_time
test_data_time = test_data$time*(max_time-min_time)+min_time
a04 = cbind.data.frame(test_data_time,pred_nn2,data1[-index,2],data1[-index,1])
colnames(a04) = c("actual","prediction","machine","app")
attach(a04)
p01 = ggplot(a04,aes(x=actual,y=prediction))+
geom_point(aes(color=machine),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (TEST DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
p02 = ggplot(a04,aes(x=actual,y=prediction))+
geom_point(aes(color=app),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (TEST DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
grid.arrange(p01,p02,nrow=1)
#EARLY STOPPING TEST
i = 1000
summary_data = data.frame(matrix(rep(0,4*i),ncol=4))
colnames(summary_data) = c("treshold","epochs","train_mse","test_mse")
for (j in 1:i){
a = runif(1,min=0.01,max=10)
#Train the model
model_nn2 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=train_data,hidden=3,stepmax=1000000,threshold=a)
#Calculate Min and Max Response for rescaling
max_time = max(data2$time)
min_time = min(data2$time)
#Predict test data from trained nn
pred_nn2_scaled = compute(model_nn2,test_data[,c(1,2,4,5)])
#Rescale test prediction
pred_test_data_time = pred_nn2_scaled$net.result*(max_time-min_time)+min_time
#Rescale test actual
test_data_time = test_data$time*(max_time-min_time)+min_time
#Rescale train prediction
pred_train_data_time = model_nn2$net.result[[1]][,1]*(max_time-min_time)+min_time
#Rescale train actual
train_data_time = train_data$time*(max_time-min_time)+min_time
#Calculate mse
test_mse = mean((pred_test_data_time-test_data_time)^2)
train_mse = mean((pred_train_data_time-train_data_time)^2)
#Summarize
summary_data[j,1] = a
summary_data[j,2] = model_nn2$result.matrix[3,]
summary_data[j,3] = round(train_mse,3)
summary_data[j,4] = round(test_mse,3)
print(summary_data[j,])
}
plot(summary_data$epochs,summary_data$test_mse,pch=19,xlim=c(0,2000),ylim=c(0,300000),cex=0.5,xlab="Training Steps",ylab="MSE",main="Early Stopping Test: Comparing MSE : TEST=BLACK TRAIN=RED")
points(summary_data$epochs,summary_data$train_mse,pch=19,col=2,cex=0.5)
I would guess that it is overfitting. The network is learning to reproduce the data like a dictionary instead of learning the underlying function in the data. There are various things which can cause this and ways to address them.
Things which cause overfitting are:
The network could be training for too long.
The network could having far more weights than training examples.
Ways to reduce overfitting are:
Create a validation dataset and stop training the network as soon as the
validation set's loss starts increasing. This is a necessity.
Reducing the network size. (Less weights)
Using a regularization technique like weight decay or dropout.
Also, it may be possible that the problem is too difficult for a neural network to solve based on the data it is given. Reproducing training data does not prove that the network can solve the problem, it only proves that the network can remember things like a dictionary.

Resources