I am working on implementing a Generative Adversarial Network (GAN) in PyTorch 1.5.0.
For computing the loss of the generator, I compute both the negative probabilities that the discriminator mis-classifies an all-real minibatch and an all-(generator-generated-)fake minibatch. Then, I back-propagate both parts sequentially and finally apply the step function.
Calculating and back-propagating the part of the loss which is a function of the mis-classifications of the generated fake data seems straight forward, since during back-propagation of that loss term, the backward path leads through the generator who has produced the fake data in the first place.
However, classification of all-real-data minibatches does not involve passing data through the generator. Therefore, I was wondering whether the following code snipped would still calculate gradients for the generator or whether it would not calculate any gradients at all (since the backward path does not lead through the generator and the discriminator is in eval-mode while updating the generator)?
# Update generator #
net.generator.train()
net.discriminator.eval()
net.generator.zero_grad()
# All-real minibatch
x_real = get_all_real_minibatch()
y_true = torch.full((batch_size,), label_fake).long() # Pretend true targets were fake
y_pred = net.discriminator(x_real) # Produces softmax probability distribution over (0=label_fake,1=label_real)
loss_real = NLLLoss(torch.log(y_pred), y_true)
loss_real.backward()
optimizer_generator.step()
If this doesn’t work as intended, how could I make it work? Thanks in advance!
No gradients are propagated to the generator, as no calculation was performed with any of the generator's parameters. The discriminator being in eval mode would not prevent the gradients from propagating to the generator, albeit they would be slightly different if you are using layers that behave differently in eval mode compared to train mode, such as dropout.
The misclassification of real images is not part of training the generator, because it doesn't gain anything from this information. Conceptually, what should the generator learn from the fact that the discriminator failed to correctly classify a real image? The sole task of the generator is to create a fake image such that the discriminator thinks it's real, therefore the only relevant information for the generator is whether the discriminator was able to identify the fake image. If the discriminator was indeed able to identify the fake image, the generator needs to adjust itself to create a more convincing fake.
Of course it's not a binary case, but the generator always tries to improve the fake image such that the discriminator is even more convinced that it was a real image. The generator's goal is not to make the discriminator be doubtful (probability of 0.5 that it's real or fake), but that the discriminator is fully convinced that it's real, even though it's fake. That's why they are adversarial, not cooperative.
Related
I‘m currently coding a Generative Adversarial Network (GAN) from scratch with my own neural network library to generate MNIST handwritten digits. The discriminator seems to work fine, but the generator doesn‘t really learn anything over time. Maybe my training approach is wrong.
So my question is, if I can actually train my generator this way.
So first I train my discriminator with real Examples and the output 1 and then with fake examples generated by the generator and the output 0. This works fine.
Next I train the generator by running the discriminator with fake examples, but with the output 1 (the generator wants the discriminator to classify his generated images as real),and I backpropagate the error all the way back to the input layer of the discriminator, but without updating his weights. This error of the input layer I then backpropagate through the generator and update him based on this.
Can I actually do that and backpropagate the error of the discriminator through the generator? The generator is essentially the input to the discriminator right? Or this there a better way to do it?
Any help is appreciated.
From your question, I assume you are proposing an approach like this: While training discriminator, you want to backpropagate till generator (to the point where we provide noise) instead of detaching it at the beginning of discriminator( the first layer of discriminator) ?
If this is the case, then you are updating generator parameters with respect to discriminator's loss. The job of Discriminator is to update it's parameters so that it can classify between real and fake. If you don't stop the backpropagation and let it go inside generator, the parameters of the generator will get updated wrt disc loss which makes the generator produce an image that can be easily distingushed by discriminator. This create a mess as you're training gen to fool disc and at the same time, your gen is getting fooled by disc
The approach is simply
Generate a image from Generator
Pass the real image to disc and equate it with 1
Pass the fake image to disc and equate it with 0 (or vice-versa)
Perform back prop and make sure to detach fake image (fake.detach() in pytorch). So that halts backprop there itself and doesn't update generator parameters
Then, perform generator training by passing the fake image through disc with 1 ( or 0 if you have taken the vice-versa case above)
GANs do take a lot of time to train. To perform best training,
https://github.com/soumith/ganhacks
Follow these hacks
I implemented the proposed GAN Model from the Paper Edge-Connect (https://github.com/knazeri/edge-connect) in Keras and did some trainings on the KITTI dataset. Now I am trying to figure out what's going on inside my model and therefore I have a few questions.
1. Initial Training (100 Epochs, 500 batches/epoch, 10 Samples/Batch)
At first I trained the model as proposed in the paper (incuding style-, perceptual-, L1- and adversarial loss)
At first sight, the model converges to nice results:
This is the output of the generator(left) for the masked input(right)
Most of the graphs from the tensorboard look quite good as well:
(These are all values from the GAN-Model, containing the total loss of the generator(GENERATOR_Loss), different losses based on the generated image (L1, perc, style) as well as the adversarial loss (DISCRIMINATOR_loss)
When closely looking at the discriminator, things look different. The adversarial loss of the discriminiator for the generated images steadly increases.
The loss while training the discriminator (50/50 fake/real examples) doesn't change at all:
![] (https://i.stack.imgur.com/o5jCA.png)
And when looking at the histogram of activations of the output of the discriminator it always outputs values around 0.5.
Coming to my questions/conclusions where I would appreciate your feedback:
So I assume now, that my model learned a lot but nothing from the discriminator, right? The results are all based on the losses other
than the adversarial loss?
It seems that the Discriminator could not keep up with the generator generating better images. I think the discriminators activations should somehow early move to two peaks at around 0 (fake labels) and 1 (real lables) and stay there?
I know that my final goal is that the discriminator outputs 0.5 probability for real as well as fake... but what does it mean when this happens right from the beginning and doesn't change during training?
Did I stop training too early? Could the discriminator catch up (since the output of the generator doesn't change much anymore) and eliminate the last tiny faults of the generator?
2. Thus I started a second training, this time only using the adversarial loss in the generator! (~16 Epochs, 500 batches/epoch, 10 Samples/Batch)
This time the discriminator seems to be able to differentiate between real and fake after a while.
(prob_real is the mean probability assigned to real images and vice versa)
The histogram of activations looks good as well:
But somehow after around 4k Samples things start to change and at around 7k it diverges...
Also all samples from the generator look like this:
Coming to my second part of questions/conclusions:
Should I pretrain the discriminator so it gets a head start? I guess it needs to somehow be able to differentiate between real and fake (outputting large probabilites for real and vice versa) so the generator can learn usefull things from it? Should I train the discriminator multiple times while training the generator one step for the same reason?
What happend in the second training? Was the learn rate for the discriminator too high? (Opt: ADAM, lr=1.0E-3)
Many hints on the internet for training GANs aim for increasing the difficulty of the discriminators job (Label noise/label flipping, instance noise, label smoothing etc). Here I think the discriminator rather needs to be boosted? (-> I also trained the Disc without changing the generator and it converges nicely)
If discriminator outputs 0.5 probability directly in the beginning of the network it means that the weights of the discriminator are not being updated and it plays no role in training, which further indicates it is not able to differentiate between real and fake image coming from the generator. To solve this issue try to add Gaussian noise as an input to the discriminator or do label smoothing which are very simple and effective techniques.
In answer to your this question, that The results are all based on the losses other than the adversarial loss , the trick that can be used is try to train the network first on all the losses except the adversarial loss and then fine tune on the adversarial losses, hope it helps.
For the second part of your questions, the generated images seem to face the problem of mode collpase where they tend to learn color, degradation from 1 image and pass the same to the other images , try to solve it out by either decreasing the batch size or using unrolled gans,
I am trying to understand how a GAN is trained. I believe understand the Adversarial training process. What I can't seem to find information on is this: do GANs use class labels in the training process? My current understanding says no - because the discriminator is simply trying to discriminate between real or fake images, while the generator is trying to create real image (but not images of any specific class.)
If this is the case, then how do researchers propose to use the discriminator network for classification tasks? the network would only be able to perform two way classification between real or fake images. The generator network would also be difficult to use, seeing as we don't know what setting of the input vector 'Z' will result in the required generated image.
It completely depends on the network you are trying to build. If you are talking specifically about the basic GAN, then you are correct. Class labels are not needed as the discriminator network is only classifying real/fake images. There is a conditional variant of the GAN (cGAN) where you do make use of the class labels in both the generator and the discriminator. This allows you to produce examples for a specific class with the generator and classify them with the discriminator (along with the real/fake classification)
From the reading that I have done, the discriminator network is just used as a tool for training the generator, and the generator is the main network of concern. Why would you use the discriminator that you used to train the GAN for classification when you could just use a ResNet or VGG net for your classification tasks. These networks would work better anyway. You are right however that using the original GAN could cause difficulty because of the mode collapse and constantly producing the same image. That is why the conditional variant was introduced.
Hope this clears things up!
Do GANs use class labels in the training process?
The author suspected GANs doesn't require labels. This is correct. The discriminator is trained to classify real and fake images. Since we know which images are real and which are generated by the generator, we do not need labels to train the discriminator. The generator is trained to fool the discriminator, which also doesn't require labels.
This is one of the most attractive benefits of GANs [1]. Usually, we refer to methods that do not require labels as unsupervised learning. That said, if we had labels, maybe we could train a GAN that uses the labels to improve performance. This idea underlies the follow-up work by [2] who introduced the conditional GAN.
If this is the case, then how do researchers propose to use the discriminator network for classification tasks?
There seems to be a misunderstanding here. The purpose of the discriminator is NOT to act as a classifier on real data. The purpose of the discriminator is to "tell the generator how to improve its fakes". This is done by using the discriminator as a loss function, which we can backpropagate gradients through if it is a neural network. After training, we usually discard the discriminator.
The generator network would also be difficult to use, seeing as we don't know what setting of the input vector 'Z' will result in the required generated image.
It seems the underlying reason for posting the question lies here. The input vector 'Z' is chosen such that it follows some distribution, typically a normal distribution. But then what happens if we take 'Z', a random vector with normally distributed entries, and computes 'G(Z)'? We get a new vector which follows a very complicated distribution that depends on G. The entire idea of GANs is to change G such that this new complicated distribution is close to the distribution of our data. This idea is formalized with f-Divergences in [3].
[1] https://arxiv.org/abs/1406.2661
[2] https://arxiv.org/abs/1411.1784
[3] https://arxiv.org/abs/1606.00709
I have this 5-5-2 backpropagation neural network I'm training, and after reading this awesome article by LeCun I started to put in practice some of the ideas he suggests.
Currently I'm evaluating it with a 10-fold cross-validation algorithm I made myself, which goes basically like this:
for each epoch
for each possible split (training, validation)
train and validate
end
compute mean MSE between all k splits
end
My inputs and outputs are standardized (0-mean, variance 1) and I'm using a tanh activation function. All network algorithms seem to work properly: I used the same implementation to approximate the sin function and it does it pretty good.
Now, the question is as the title implies: should I standardize each train/validation set separately or do I simply need to standardize the whole dataset once?
Note that if I do the latter, the network doesn't produce meaningful predictions, but I prefer having a more "theoretical" answer than just looking at the outputs.
By the way, I implemented it in C, but I'm also comfortable with C++.
You will most likely be better off standardizing each training set individually. The purpose of cross-validation is to get a sense for how well your algorithm generalizes. When you apply your network to new inputs, the inputs will not be ones that were used to compute your standardization parameters. If you standardize the entire data set at once, you are ignoring the possibility that a new input will fall outside the range of values over which you standardized.
So unless you plan to re-standardize every time you process a new input (which I'm guessing is unlikely), you should only compute the standardization parameters for the training set of the partition being evaluated. Furthermore, you should compute those parameters only on the training set of the partition, not the validation set (i.e., each of the 10-fold partitions will use 90% of the data to calculate standardization parameters).
So you assume the inputs are normally distribution and are subtracting the mean, dividing by standard deviation, to get N(0,1) distributed inputs?
Yes I agree with #bogatron that you standardize each training set separately, but I would more strongly say it's a "must" to not use the validation set data too. The problem is not values outside the range in the training set; this is fine, the transformation to a standard normal is still defined for any value. You can't compute mean / standard deviation overa ll the data because you can't in any way use the validation data in the training set, even if just via this statistic.
It should further be emphasized that you use the mean from the training set with the validation set, not the mean from the validation set. It has to be the same transformation of features that was used during training. It would not be valid to transform the validation set differently.
Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).