I have built my own neural network in java.
Its neuron layers follow this structure: 784(inputs), 200, 80, 10(outputs)
After feeding it the MNIST training data in 300 batches of 100 randomly selected images, updating weights and biases after each batch. With a learning rate of .005. However, the network seems to adopt a strategy of giving an output of all zeros every time. Because just saying
{0,0,0,0,0,0,0,0,0,0}is much closer to {0,1,0,0,0,0,0,0,0,0}than any actual guessing strategy that it tried. On occasion, it will attempt to change but can never find a strategy that works better than that of just saying zero for everything. Can anyone tell me how to fix this? Does it need more training data? Does this mean there's an error in the backpropagation function I wrote?
Thanks for any suggestions!
Make sure your data label is integer encoded
Make sure the final Dense layer has 10 units with a softmax activation function.
Compile your model with sparse_categorical_crossentropy as the loss function.
Related
I implemented the proposed GAN Model from the Paper Edge-Connect (https://github.com/knazeri/edge-connect) in Keras and did some trainings on the KITTI dataset. Now I am trying to figure out what's going on inside my model and therefore I have a few questions.
1. Initial Training (100 Epochs, 500 batches/epoch, 10 Samples/Batch)
At first I trained the model as proposed in the paper (incuding style-, perceptual-, L1- and adversarial loss)
At first sight, the model converges to nice results:
This is the output of the generator(left) for the masked input(right)
Most of the graphs from the tensorboard look quite good as well:
(These are all values from the GAN-Model, containing the total loss of the generator(GENERATOR_Loss), different losses based on the generated image (L1, perc, style) as well as the adversarial loss (DISCRIMINATOR_loss)
When closely looking at the discriminator, things look different. The adversarial loss of the discriminiator for the generated images steadly increases.
The loss while training the discriminator (50/50 fake/real examples) doesn't change at all:
![] (https://i.stack.imgur.com/o5jCA.png)
And when looking at the histogram of activations of the output of the discriminator it always outputs values around 0.5.
Coming to my questions/conclusions where I would appreciate your feedback:
So I assume now, that my model learned a lot but nothing from the discriminator, right? The results are all based on the losses other
than the adversarial loss?
It seems that the Discriminator could not keep up with the generator generating better images. I think the discriminators activations should somehow early move to two peaks at around 0 (fake labels) and 1 (real lables) and stay there?
I know that my final goal is that the discriminator outputs 0.5 probability for real as well as fake... but what does it mean when this happens right from the beginning and doesn't change during training?
Did I stop training too early? Could the discriminator catch up (since the output of the generator doesn't change much anymore) and eliminate the last tiny faults of the generator?
2. Thus I started a second training, this time only using the adversarial loss in the generator! (~16 Epochs, 500 batches/epoch, 10 Samples/Batch)
This time the discriminator seems to be able to differentiate between real and fake after a while.
(prob_real is the mean probability assigned to real images and vice versa)
The histogram of activations looks good as well:
But somehow after around 4k Samples things start to change and at around 7k it diverges...
Also all samples from the generator look like this:
Coming to my second part of questions/conclusions:
Should I pretrain the discriminator so it gets a head start? I guess it needs to somehow be able to differentiate between real and fake (outputting large probabilites for real and vice versa) so the generator can learn usefull things from it? Should I train the discriminator multiple times while training the generator one step for the same reason?
What happend in the second training? Was the learn rate for the discriminator too high? (Opt: ADAM, lr=1.0E-3)
Many hints on the internet for training GANs aim for increasing the difficulty of the discriminators job (Label noise/label flipping, instance noise, label smoothing etc). Here I think the discriminator rather needs to be boosted? (-> I also trained the Disc without changing the generator and it converges nicely)
If discriminator outputs 0.5 probability directly in the beginning of the network it means that the weights of the discriminator are not being updated and it plays no role in training, which further indicates it is not able to differentiate between real and fake image coming from the generator. To solve this issue try to add Gaussian noise as an input to the discriminator or do label smoothing which are very simple and effective techniques.
In answer to your this question, that The results are all based on the losses other than the adversarial loss , the trick that can be used is try to train the network first on all the losses except the adversarial loss and then fine tune on the adversarial losses, hope it helps.
For the second part of your questions, the generated images seem to face the problem of mode collpase where they tend to learn color, degradation from 1 image and pass the same to the other images , try to solve it out by either decreasing the batch size or using unrolled gans,
I am trying to implement Neural Networks for classifcation having 5 hidden layers, and with softmax cross entropy in the output layer. The implementation is in JAVA.
For optimization, I have used MiniBatch gradient descent(Batch size=100, learning rate = 0.01)
However, after a couple of iterations, the weights become "NaN" and the predicted values turn out to be the same for every testcase.
Unable to debug the source of this error.
Here is the github link to the code(with the test/training file.)
https://github.com/ahana204/NeuralNetworks
In my case, i forgot to normalize the training data (by subtracting mean). This was causing the denominator of my softmax equation to be 0. Hope this helps.
Assuming the code you implemented is correct, one reason would be large learning rate. If learning rate is large, weights may not converge and may become very small or very large which could be shown NaN. Try to lower learning rate to see if anything changes.
I am trying to design a neural network that makes a custom binary prediction.
Normally to do binary prediction, I would use a softmax as my last layer, and then my loss could be the difference between the prediction I made and the true binary value.
However, what if I don't want to use a softmax layer. Instead, I output a real valued number, and check if some condition on this number is true. In a really simple case, I check if this number is positive. If it is, I predict 1, else I predict 0. Let's say I want all the numbers to be positive, so the true predictions should be all 1, and then I want to train this network such that it outputs all positive numbers. I am confused as how to formulate a loss function for this problem, so that I am able to back propagate and train the network.
Does anyone have an idea how to create this kind of network?
I am confused as how to formulate a loss function for this problem, so
that I am able to back propagate and train the network.
Here's how you should approach it. Effectively, you need to transform the labels to positive and negative target values (say +1 and -1) and solve the regression problem. The loss function can be a simple L1 or L2 loss. The network will try to learn to output a prediction close to the training target, which you can afterwards interpret if it's closer to one target or another, i.e. positive or negative. You can even go ahead and make some targets larger (e.g. +2 or +10) to emphasize that these examples are very important. Example code: linear regression in tensorflow.
However, I simply have to warn you that your approach has serious drawbacks, see for instance this question. One outlier in training data can easily skew your predictions. Classification with softmax + cross-entropy loss is more stable, that's why almost always a better choice.
I tried batch normalization for the LSTM weights as per https://arxiv.org/abs/1603.09025 on a Convolutional-RNN based network and I got notable improvement in training speed and performance. The features extracted from CNN are fed into 2 layers of Bidirectional LSTM.
In my first network I used few feature maps, so the input to the LSTM layers was 128. However, when I increase the input size (eg 256), I start getting NaNs for the LSTM output after some iterations (it works fine without batch normalization). I understand that this might be related to the division by small numbers. I also used an epsilon of 10^-6, but still getting NaNs.
Any ideas on what can I do to get rid of NaNs? Thanks.
For those who are having the same problem, using float64 data type instead of float32 helps in solving this issue. Of course this has memory implications, but I found it to be the only solution so far.
I have a 1-D time series classification problem, and I have imported the data into Torch. I have written two different networks to learn the data. Each row is to be labelled as either a 1 or a 0.
The problem is that the loss of the Convolutional Network does not fall after the first iteration. It stays at exactly the same value, after iteration one. This is not true for the other network - a Logistic Regression. The loss of that network does fall over time.
Below is the ConvNet:
model = nn.Sequential()
for i = 1, iteration do
model:add(nn.TemporalConvolution(1,1,3,1))
model:add(nn.BatchNormalization(1))
model:add(nn.ReLU())
model:add(nn.TemporalMaxPooling(3,2))
if i == iteration then
model:add(nn.Sigmoid())
end
end
Since the LogReg's loss does fall, I assume the problem is to do with the ConvNet itself, rather than anything else in the code.
Any advice would be much appreciated. I am happy to post more code if required.
Usually if there is no improvement when minimizing the loss function the model is already at a local(global) minimum.
This can have several reasons, e.g. learning-rate, regularization, data does not fit to the model in some way ...
It's hard to say just based on the model.
Did you use the exact same code for training in LogReg?
You can check out this tutorial for some information about temporalConvolution:
http://supercomputingblog.com/machinelearning/an-intro-to-convolutional-networks-in-torch/