What is the difference between these backward training methods in Pytorch?

What is the difference between these backward training methods in Pytorch? - machine-learning

I am a 3-month DL freshman who is doing small NLP projects with Pytorch.
Recently I am trying to reappear a GAN network introduced by a paper, using my own text data, to generate some specific kinds of question sentences.
Here is some background... If you have no time or interest about it, just kindly read the following question is OK.
As that paper says, the generator is firstly trained normally with normal question data to make that the output at least looks like a real question. Then by using an auxiliary classifier's result (of classifying the outputs), the generator is trained again to just generate the specific (several unique categories) questions.
However, as the paper do not reveal its code, I have to do the code all myself. I have these three training thoughts, but I do not know their differences, could you kindly tell me about it?
If they have almost the same effect, could you tell me which is more recommended in Pytorch's grammar? Thank you very much!
Suppose the discriminator loss to generator is loss_G_D, the classifier loss to generator is loss_G_C, and loss_G_D and loss_G_C has the same shape, i.e. [batch_size, loss value], then what is the difference?
1.
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_C = loss_func2(classifier(generated_data))
loss = loss_G+loss_C
loss.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()
Additional info: I observed that the classifier's classification loss is always very big compared with generator's loss, like -300 vs 3. So maybe the third one is better?

First of all:
loss.backward() backpropagates the error and assigns a gradient for every parameter along the way that has requires_grad=True.
optimizer.step() updates the model parameters using their stored gradients
optimizer.zero_grad() sets the gradients to 0, so that you can backpropagate your loss and update your model parameters for each batch without interfering with other batches.
1 and 2 are quite similar, but if your model uses batch statistics or you have an adaptive optimizer they will probably perform differently. However, for instance, if your model doesn't use batch statistics and you have a plain old SGD optimizer, they will produce the same result, even though 1 would be faster since you do the backprop only once.
3 is a completely different case, since you update your model parameters with loss_G_D.backward() and optimizer.step() before processing and backpropagating loss_G_C.
Given all of these, it's up to you which one to choose depending on your application.

Related

Is loss.backward() meant to be called on each sample or on each batch?

I have a training dataset which contains features of different sizes. I understand the implications of this in terms of network architecture and have designed my network accordingly to handle these heterogeneous shapes. When it comes to my training loop, though, I'm confused as to the order/placement of optimizer.zero_grad(), loss.backward(), and optimizer.step().
Because of the unequal feature sizes, I cannot do forward pass upon features of a batch at the same time. So, my training loop loops through samples of a batch manually, like this:
for epoch in range(NUM_EPOCHS):
for bidx, batch in enumerate(train_loader):
optimizer.zero_grad()
batch_loss = 0
for sample in batch:
feature1 = sample['feature1']
feature2 = sample['feature2']
label1 = sample['label1']
label2 = sample['label2']
pred_l1, pred_l2 = model(feature1, feature2)
sample_loss = compute_loss(label1, pred_l1)
sample_loss += compute_loss(label2, pred_l2)
sample_loss.backward() # CHOICE 1
batch_loss += sample_loss.item()
# batch_loss.backward() # CHOICE 2
optimizer.step()
I'm wondering if it makes sense here that backward is called upon each sample_loss with the optimizer step called every BATCH_SIZE samples (CHOICE 1). The alternative, I think, would be to call backward upon batch_loss (CHOICE 2) and I'm not so sure which is the right choice.

Differentiation is a linear operation, so in theory it should not matter whether you first differentiate the different losses and add their derivatives or whether you first add the losses and then compute the derivative of their sum.
So for practical purposes both of them should lead to the same results (disregarding to the usual floating point issues).
You might get a slightly different memory requirements and computation speeds (I'd guess the second version might be slightly faster.), but that is hard to predict but something that you can easily find out by timing the two versions.

How to use pretrained BERT word embedding vector to finetune (initialize) other networks?

When I used to do classification work with textcnn, I had experience finetuning textcnn using pretrained word embedding with like Word2Vec and fasttext. And I use this process:
Create an embedding layer in textcnn
Load the embedding matrix of the words used this time by Word2Vec or
fasttext
Since the vector value of the embedding layer will change during training, the network is
being finetuning.
Recently I also want to try BERT to do this. I thought, 'As there should be few differences to use BERT pretrained embedding to initial other networks' embedding layer and finetuning, it should be easy!' But in fact yesterday I tried all day and still cannot do it.
The fact I found is that, as BERT's embedding is a contextual embedding, especially when extracting the word embeddings, the vector of each word from each sentence will vary, so it seems that there is no way to use that embedding to initialize the embedding layer of another network as usual...
Finally, I thought up one method to 'finetuning', as the following steps:
First, do not define an embedding layer in textcnn.
Instead of using embedding layer, in the network training part, I
firstly pass sequence tokens to the pretrained BERT model and get
the word embeddings for each sentence.
Put the BERT word embedding from 2. into textcnn and train the
textcnn network.
By using this method I was finally able to train, but thinking seriously, I don't think I'm doing a finetuning at all...
Because as you can see, every time when I start a new training loop, the word embedding generated from BERT is always the same vector, so just input these unchanged vectors to the textcnn wouldn't let the textcnn be finetuned at all, right?
UPDATE:
I thought up a new method to use the BERT embeddings and 'train' BERT and textcnn together.
Some part of my code is:
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
I think by enable BERTmodel.train() and delete torch.no_grad() when get the embedding, the loss gradient could be backward to BERTmodel, too. The training process of TextCNNmodel also went smoothly.
To use this model later, I saved the parameters of both TextCNNmodel and BERTmodel.
Then to experiment whether the BERTmodel was really being trained and changed, in another program I load the BERTModel, and input a sentence to test that whether the BERTModel was really being trained.
However, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are the same, which is disappointing...
I really have no idea why the BERTmodel part did not change...

Here I would like to thanks #Jindřich , thank you for giving me the important hint!
I think I am almost there when using my updated version code, but I forgot to set an optimizer for BERTmodel.
After I set the optimizer and did the training process again, this time when I load my BERTmodel, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are finally different, which means this BERTmodel is changed and should be finetuned.
Here is my final codes, hope it could help you, too.
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
optimizer_bert = torch.optim.Adamw(BERTmodel.parameters(), lr=2e-5, weight_decay=1e-2)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
optimizer_bert.zero_grad()
loss.backward()
optimizer.step()
optimizer_bert.step()
I will continue my experiments to see if my BERTmodel is really being finetuned.

Validation Loss Much Higher Than Training Loss

I'm very new to deep learning models, and trying to train a multiple time series model using LSTM with Keras Sequential. There are 25 observations per year for 50 years = 1250 samples, so not sure if this is even possible to use LSTM for such small data. However, I have thousands of feature variables, not including time lags. I'm trying to predict a sequence of the next 25 time steps of data. The data is normalized between 0 and 1. My problem is that, despite trying many obvious adjustments, I cannot get the LSTM validation loss anywhere close to the training loss (overfitting dramatically, I think).
I have tried adjusting number of nodes per hidden layer (25-375), number of hidden layers (1-3), dropout (0.2-0.8), batch_size (25-375), and train/ test split (90%:10% - 50%-50%). Nothing really makes much of a difference on the validation loss/ training loss disparity.
# SPLIT INTO TRAIN AND TEST SETS
# 25 observations per year; Allocate 5 years (2014-2018) for Testing
n_test = 5 * 25
test = values[:n_test, :]
train = values[n_test:, :]
# split into input and outputs
train_X, train_y = train[:, :-25], train[:, -25:]
test_X, test_y = test[:, :-25], test[:, -25:]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 5, newdf.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 5, newdf.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# design network
model = Sequential()
model.add(Masking(mask_value=-99, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(LSTM(375, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(125, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(25))
model.add(Dense(25))
model.compile(loss='mse', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=20, batch_size=25, validation_data=(test_X, test_y), verbose=2, shuffle=False)
Epoch 19/20
14s - loss: 0.0512 - val_loss: 188.9568
Epoch 20/20
14s - loss: 0.0510 - val_loss: 188.9537
I assume I must be doing something obvious wrong, but can't realize it since I'm a newbie. I am hoping to either get some useful validation loss achieved (compared to training), or know that my data observations are simply not large enough for useful LSTM modeling. Any help or suggestions is much appreciated, thanks!

Overfitting
In general, if you're seeing much higher validation loss than training loss, then it's a sign that your model is overfitting - it learns "superstitions" i.e. patterns that accidentally happened to be true in your training data but don't have a basis in reality, and thus aren't true in your validation data.
It's generally a sign that you have a "too powerful" model, too many parameters that are capable of memorizing the limited amount of training data. In your particular model you're trying to learn almost a million parameters (try printing model.summary()) from a thousand datapoints - that's not reasonable, learning can extract/compress information from data, not create it out of thin air.
What's the expected result?
The first question you should ask (and answer!) before building a model is about the expected accuracy. You should have a reasonable lower bound (what's a trivial baseline? For time series prediction, e.g. linear regression might be one) and an upper bound (what could an expert human predict given the same input data and nothing else?).
Much depends on the nature of the problem. You really have to ask, is this information sufficient to get a good answer? For many real life time problems with time series prediction, the answer is no - the future state of such a system depends on many variables that can't be determined by simply looking at historical measurements - to reasonably predict the next value, you need to bring in lots of external data other than the historical prices. There's a classic quote by Tukey: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."

Is it okay to use STATEFUL Recurrent NN (LSTM) for classification

I have a dataset C of 50,000 (binary) samples each of 128 features. The class label is also binary either 1 or -1. For instance, a sample would look like this [1,0,0,0,1,0, .... , 0,1] [-1]. My goal is to classify the samples based on the binary classes( i.e., 1 or -1). I thought to try using Recurrent LSTM to generate a good model for classification. To do so, I have written the following code using Keras library:
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
batch_size = 200
print('>>> Build STATEFUL model...')
model = Sequential()
model.add(LSTM(128, batch_input_shape=(batch_size, C.shape[1], C.shape[2]), return_sequences=False, stateful=True))
model.add(Dense(1, activation='softmax'))
print('>>> Training...')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(tr_C, tr_r,
batch_size=batch_size, epochs=1, shuffle=True,
validation_data=(ts_C, ts_r))
However, I am getting bad accuracy, not more than 55%. I tried to change the activation function along with the loss function hoping to improve the accuracy but nothing works. Surprisingly, when I use Multilayer Perceptron, I get very good accuracy around 97%. Thus, I start questioning if LSTM can be used for classification or maybe my code here has something missing or it is wrong. Kindly, I want to know if the code has something missing or wrong to improve the accuracy. Any help or suggestion is appreciated.

You cannot use softmax as an output when you have only a single output unit as it will always output you a constant value of 1. You need to either change output activation to sigmoid or set output units number to 2 and loss to categorical_crossentropy. I would advise the first option.

How to find if a data set can train a neural network?

I'm a newbie to machine learning and this is one of the first real-world ML tasks challenged.
Some experimental data contains 512 independent boolean features and a boolean result.
There are about 1e6 real experiment records in the provided data set.
In a classic XOR example all 4 out of 4 possible states are required to train NN. In my case its only 2^(10-512) = 2^-505 which is close to zero.
I have no more information about the data nature, just these (512 + 1) * 1e6 bits.
Tried NN with 1 hidden layer on available data. Output of the trained NN on the samples even from the training set are always close to 0, not a single close to "1". Played with weights initialization, gradient descent learning rate.
My code utilizing TensorFlow 1.3, Python 3. Model excerpt:
with tf.name_scope("Layer1"):
#W1 = tf.Variable(tf.random_uniform([512, innerN], minval=-2/512, maxval=2/512), name="Weights_1")
W1 = tf.Variable(tf.zeros([512, innerN]), name="Weights_1")
b1 = tf.Variable(tf.zeros([1]), name="Bias_1")
Out1 = tf.sigmoid( tf.matmul(x, W1) + b1)
with tf.name_scope("Layer2"):
W2 = tf.Variable(tf.random_uniform([innerN, 1], minval=-2/512, maxval=2/512), name="Weights_2")
#W2 = tf.Variable(tf.zeros([innerN, 1]), name="Weights_2")
b2 = tf.Variable(tf.zeros([1]), name="Bias_2")
y = tf.nn.sigmoid( tf.matmul(Out1, W2) + b2)
with tf.name_scope("Training"):
y_ = tf.placeholder(tf.float32, [None,1])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels = y_, logits = y)
)
train_step = tf.train.GradientDescentOptimizer(0.005).minimize(cross_entropy)
with tf.name_scope("Testing"):
# Test trained model
correct_prediction = tf.equal( tf.round(y), tf.round(y_))
# ...
# Train
for step in range(500):
batch_xs, batch_ys = Datasets.train.next_batch(300, shuffle=False)
_, my_y, summary = sess.run([train_step, y, merged_summaries],
feed_dict={x: batch_xs, y_: batch_ys})
I suspect two cases:
my fault – bad NN implementation, wrong architecture;
bad data. Compared to XOR example, incomplete training data would result in a failing NN. However, the training examples fed to the trained NN are supposed to give right predictions, aren't they?
How to evaluate if it is possible at all to train a neural network (a 2-layer perceptron) on the provided data to forecast the result? A case of aceptable set would be the XOR example. Opposed to some random noise.

There are only ad hoc ways to know if it is possible to learn a function with a differentiable network from a dataset. That said, these ad hoc ways do usually work. For example, the network should be able to overfit the training set without any regularisation.
A common technique to gauge this is to only fit the network on a subset of the full dataset. Check that the network can overfit to that, then increase the size of the subset, and increase the size of the network as well. Unfortunately, deciding whether to add extra layers or add more units in a hidden layer is an arbitrary decision you'll have to make.
However, looking at your code, there are a few things that could be going wrong here:
Are your outputs balanced? By that I mean, do you have the same number of 1s as 0s in the dataset targets?
Your initialisation in the first layer is all zeros, the gradient to this will be zero, so it can't learn anything (although, you have a real initialisation above it commented out).
Sigmoid nonlinearities are more difficult to optimise than simpler nonlinearities, such as ReLUs.
I'd recommend using the built-in definitions for layers in Tensorflow to not worry about initialisation, and switching to ReLUs in any hidden layers (you need sigmoid at the output for your boolean target).
Finally, deep learning isn't actually very good at most "bag of features" machine learning problems because they lack structure. For example, the order of the features doesn't matter. Other methods often work better, but if you really want to use deep learning then you could look at this recent paper, showing improved performance by just using a very specific nonlinearity and weight initialisation (change 4 lines in your code above).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart