How to create a dynamic learning rate per neuron in PyTorch? - machine-learning

I know it's possible to have a learning rate per layer (link). I also found how to dynamically change the learning rate (changing it in the middle of training dynamically without a scheduler) (link).
How can I create an optimizer that will have a dynamic learning rate per neuron? So that I could change the value of the learning rate for specific neurons during training
As an example, if my network is as follows:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(3,5)
self.fc2 = nn.Linear(5,10)
self.fc3 = nn.Linear(10,1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
return x
There should be 5 learning rates for the first layer (one for each of the 5 neurons, where each neuron has 3 associated weights), 10 for the second layer, and 1 for the last one.


Discriminator's loss stuck at value = 1 while training conditional GAN

I am training a conditional GAN that generates image time series (similar to video prediction). I built a conditional GAN based on this paper. However, several probelms happened when I was training the cGAN.
Problems of training cGAN:
The discriminator's loss stucks at one.
It seems like the generator's loss is not effected by discriminator no matter how I adjust the hyper parameters related to the discriminator.
Training loss of discriminator
D_loss = (fake_D_loss + true_D_loss) / 2
fake_D_loss = Hinge_loss(D(G(x, z)))
true_D_loss = Hinge_loss(D(x, y))
The margin of hinge loss = 1
Training loss of generator
D_loss = -torch.mean(D(G(x,z))
G_loss = weighted MAE
Gradient flow of discriminator
Gradient flow of generator
Several settings of the cGAN:
The output layer of discriminator is linear sum.
The discriminator is trained twice per epoch while the generator is only trained once.
The number of neurons of the generator and discriminator are exactly the same as the paper.
I replaced the ReLU (original setting) to LeakyReLU to avoid nan.
I added gradient norm to avoid gradient vanishing problem.
Other hyper parameters are listed as follows:
Hyper parameters
number of input images
number of predicted images
batch size
opt_g, opt_d
The loss function I use for discriminator.
def HingeLoss(pred, validity, margin=1.):
if validity:
loss = F.relu(margin - pred)
loss = F.relu(margin + pred)
return loss.mean()
The loss function for examining the validity of predicted image from generator.
def HingeLossG(pred):
return -torch.mean(pred)
I use the trainer of pytorch_lightning to train the model. The training codes I wrote are as follows.
def training_step(self, batch, batch_idx, optimizer_idx):
x, y = batch
x.requires_grad = True
if self.n_sample > 1:
pred = [self(x) for _ in range(self.n_sample)]
pred = torch.mean(torch.stack(pred, dim=0), dim=0)
pred = self(x)
if optimizer_idx == 1:
true_D_loss = self.discriminator_loss(self.discriminator(x, y), True)
fake_D_loss = self.discriminator_loss(self.discriminator(x, pred.detach()), False)
D_loss = (fake_D_loss + true_D_loss) / 2
return D_loss
if optimizer_idx == 0:
G_loss = self.generator_loss(pred, y)
GD_loss = self.generator_d_loss(self.discriminator(x, pred.detach()))
train_G_loss = G_loss + GD_loss
return train_G_loss
I have several guesses of why these problems may occur:
Since the original model predicts 18 frames rather than 10 frames (my version), maybe the number of neurons in the original generator is too much for my case (predicting 10 frames), leading an exceedingly powerful generator that breaks the balance of training. However, I've tried to lower the learning rate of generator to 1e-5 (original 5e-5) or increase the training times of discriminator to 3 to 5 times. It seems that the loss curve of generator didn't much changed.
Various results of training cGAN
I have also adjust the weights of generator's loss, but the same problems still occurred.
The architecture codes of this model:

when setting .eval() my model performs worse than when I set .train()

During the training phase, I select the model parameters with the best performance metric.
if performance_metric.item()>max_performance:
max_performance= performance_metric.item(), PATH+'/')
This is the neural network model used:
class Neural_Net(nn.Module):
def __init__(self, M,shape_input,batch_size):
super(Neural_Net, self).__init__()
self.lstm = nn.LSTM(shape_input,M)
#self.dense1 = nn.Linear(shape_input,M)
self.dense1 = nn.Linear(M,M) #Used with the LSTM
self.dense2 = nn.Linear(M,M)
self.dense3 = nn.Linear(M,1)
self.drop = nn.Dropout(0.7)
self.bachnorm1 = nn.BatchNorm1d(M)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.hidden_cell = (torch.zeros(1,batch_size,M),torch.zeros(1,batch_size,M))
def forward(self, x):
lstm_out, self.hidden_cell = self.lstm(x.view(1 ,len(x), -1), self.hidden_cell)
x = self.drop(self.relu(self.dense1(self.bachnorm1(lstm_out.view(len(x), -1)))))
x = self.drop(self.relu(self.dense2(x)))
x = self.relu(self.dense3(x))
return x
After that I load the model with the best parameters and set the evaluation mode:
The results are completely random. When I set train() the performance is similar to the selected best model parameter.
There is an important aspect of the eval() that I am forgetting? Is the batch normalization correctly used? I am using a batch the same size as in the training phase for the test phase.
Without knowing your batch size, training/test dataset size, or the training/test dataset discrepancies, this issue has been discussed on the pytorch forums previously here.
In my experience, it sounds very much like your latent training data representation in your model is significantly different to your validation data representation. The main advice I can provide is for you to try reducing the momentum of your batchnorm layer. It might be worth substituting a layernorm layer instead (which doesn't track a running mean/standard deviation) OR setting track_running_stats=False in the batchnorm1d function and seeing if the problem persists.

PyTorch: Predicting future values with LSTM

I'm currently working on building an LSTM model to forecast time-series data using PyTorch. I used lag features to pass the previous n steps as inputs to train the network. I split the data into three sets, i.e., train-validation-test split, and used the first two to train the model. My validation function takes the data from the validation data set and calculates the predicted valued by passing it to the LSTM model using DataLoaders and TensorDataset classes. Initially, I've got pretty good results with R2 values in the region of 0.85-0.95.
However, I have an uneasy feeling about whether this validation function is also suitable for testing my model's performance. Because the function now takes the actual X values, i.e., time-lag features, from the DataLoader to predict y^ values, i.e., predicted target values, instead of using the predicted y^ values as features in the next prediction. This situation seems far from reality where the model has no clue of the real values of the previous time steps, especially if you forecast time-series data for longer time periods, say 3-6 months.
I'm currently a bit puzzled about tackling this issue and defining a function to predict future values relying on the model's values rather than the actual values in the test set. I have the following function predict, which makes a one-step prediction, but I haven't really figured out how to predict the whole test dataset using DataLoader.
def predict(self, x):
# convert row to data
x =
# make prediction
yhat = self.model(x)
# retrieve numpy array
yhat =
return yhat
You can find how I split and load my datasets, my constructor for the LSTM model, and the validation function below. If you need more information, please do not hesitate to reach out to me.
Splitting and Loading Datasets
def create_tensor_datasets(X_train_arr, X_val_arr, X_test_arr, y_train_arr, y_val_arr, y_test_arr):
train_features = torch.Tensor(X_train_arr)
train_targets = torch.Tensor(y_train_arr)
val_features = torch.Tensor(X_val_arr)
val_targets = torch.Tensor(y_val_arr)
test_features = torch.Tensor(X_test_arr)
test_targets = torch.Tensor(y_test_arr)
train = TensorDataset(train_features, train_targets)
val = TensorDataset(val_features, val_targets)
test = TensorDataset(test_features, test_targets)
return train, val, test
def load_tensor_datasets(train, val, test, batch_size=64, shuffle=False, drop_last=True):
train_loader = DataLoader(train, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
val_loader = DataLoader(val, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
return train_loader, val_loader, test_loader
Class LSTM
class LSTMModel(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob):
super(LSTMModel, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.lstm = nn.LSTM(
input_dim, hidden_dim, layer_dim, batch_first=True, dropout=dropout_prob
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, future=False):
h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
out = out[:, -1, :]
out = self.fc(out)
return out
Validation (defined within a trainer class)
def validation(self, val_loader, batch_size, n_features):
with torch.no_grad():
predictions = []
values = []
for x_val, y_val in val_loader:
x_val = x_val.view([batch_size, -1, n_features]).to(device)
y_val =
yhat = self.model(x_val)
return predictions, values
I've finally found a way to forecast values based on predicted values from the earlier observations. As expected, the predictions were rather accurate in the short-term, slightly becoming worse in the long term. It is not so surprising that the future predictions digress over time, as they no longer depend on the actual values. Reflecting on my results and the discussions I had on the topic, here are my take-aways:
In real-life cases, the real values can be retrieved and fed into the model at each step of the prediction -be it weekly, daily, or hourly- so that the next step can be predicted with the actual values from the previous step. So, testing the performance based on the actual values from the test set may somewhat reflect the real performance of the model that is maintained regularly.
However, for predicting future values in the long term, forecasting, if you will, you need to make either multiple one-step predictions or multi-step predictions that span over the time period you wish to forecast.
Making multiple one-step predictions based on the values predicted the model yields plausible results in the short term. As the forecasting period increases, the predictions become less accurate and therefore less fit for the purpose of forecasting.
To make multiple one-step predictions and update the input after each prediction, we have to work our way through the dataset one by one, as if we are going through a for-loop over the test set. Not surprisingly, this makes us lose all the computational advantages that matrix operations and mini-batch training provide us.
An alternative could be predicting sequences of values, instead of predicting the next value only, say using RNNs with multi-dimensional output with many-to-many or seq-to-seq structure. They are likely to be more difficult to train and less flexible to make predictions for different time periods. An encoder-decoder structure may prove useful for solving this, though I have not implemented it by myself.
You can find the code for my function that forecasts the next n_steps based on the last row of the dataset X (time-lag features) and y (target value). To iterate over each row in my dataset, I would set batch_size to 1 and n_features to the number of lagged observations.
def forecast(self, X, y, batch_size=1, n_features=1, n_steps=100):
predictions = []
X = torch.roll(X, shifts=1, dims=2)
X[..., -1, 0] = y.item(0)
with torch.no_grad():
for _ in range(n_steps):
X = X.view([batch_size, -1, n_features]).to(device)
yhat = self.model(X)
yhat =
X = torch.roll(X, shifts=1, dims=2)
X[..., -1, 0] = yhat.item(0)
return predictions
The following line shifts values in the second dimension of the tensor by one so that a tensor [[[x1, x2, x3, ... , xn ]]] becomes [[[xn, x1, x2, ... , x(n-1)]]].
X = torch.roll(X, shifts=1, dims=2)
And, the line below selects the first element from the last dimension of the 3d tensor and sets that item to the predicted value stored in the NumPy ndarray (yhat), [[xn+1]]. Then, the new input tensor becomes [[[x(n+1), x1, x2, ... , x(n-1)]]]
X[..., -1, 0] = yhat.item(0)
Recently, I've decided to put together the things I had learned and the things I would have liked to know earlier. If you'd like to have a look, you can find the links down below. I hope you'll find it useful. Feel free to comment or reach out to me if you agree or disagree with any of the remarks I made above.
Building RNN, LSTM, and GRU for time series using PyTorch
Predicting future values with RNN, LSTM, and GRU using PyTorch

Weight initialization in neural networks

Hi I am developing a neural network model using keras.
def base_model():
# Initialising the ANN
regressor = Sequential()
# Adding the input layer and the first hidden layer
regressor.add(Dense(units = 4, kernel_initializer = 'he_normal', activation = 'relu', input_dim = 7))
# Adding the second hidden layer
regressor.add(Dense(units = 2, kernel_initializer = 'he_normal', activation = 'relu'))
# Adding the output layer
regressor.add(Dense(units = 1, kernel_initializer = 'he_normal'))
# Compiling the ANN
regressor.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])
return regressor
I have been reading about which kernel_initializer to use and came across the link-
it talks about glorot and he initializations. I have tried with different intilizations for weights, but all of them give the same results. I want to understand how important is it do a proper initialization?
I'll give you an explanation of how much weights initialisation is important.
Let's suppose our NN has an input layer with 1000 neurons, and suppose we start to initialise weights as they are normal distributed with mean 0 and variance 1 ().
At the second layer, we assume that only 500 first layer's neurons are activated, while the other 500 not.
The neuron's input of the second layer z will be the sum of :
so, it will be even normal distributed but with variance .
This means its value will be |z| >> 1 or |z| << 1, so neurons will saturate. The network will learn slowly at all.
A solution is to initialise weights as where is the number of the inputs of the first layer. In this way z will be and so less spreader, consequently neurons are less prone to saturate.
This trick can help as a start but in deep neural networks, due to the presence of hidden multi-layers, the weights initialisation should be done at each layer. A method may be using the batch normalization
Besides this from your code I can see you'v chosen as cost function the MSE, so it is a quadratic cost function. I don't know if your problem is a classification one, but if this is the case I suggest you to use a cross-entropy function as cost function for increasing the learning rate of your network.

MLP giving inaccurate results

I tried to build a simple MLP with 2 hidden layers and 3 output classes.
What I have done in the model is:
Input images are 120x120 rgb images. Flattened size (3 * 120 * 120)
2 hidden layers of size 100.
Relu activation is used
Output layer has 3 neurons
def model(input, weights, biases):
l_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
l_1 = tf.nn.relu(l_1)
l_2 = tf.add(tf.matmul(l_1, weights['h2']), biases['b2'])
l_2 = tf.nn.relu(l_2)
out = tf.matmul(l_2, weights['out']) + biases['out']
return out
pred = model(input_batch, weights, biases)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.GradientDescentOptimizer(rate).minimize(cost)
The model however does not work. The accuracy is only equal to that of a random model.
The example followed is this one:
You have a copy-paste typo in def model. First argument name is input while it is x on the next line.
Another trick to use when you suspect that model is not being trained is to run it on the same batch again and again. If implementation is correct and model is being trained it will soon learn that batch by heart yielding 100% accuracy. If it does not then it is an indicator that something is wrong in your implementation.
