I tried batch normalization for the LSTM weights as per https://arxiv.org/abs/1603.09025 on a Convolutional-RNN based network and I got notable improvement in training speed and performance. The features extracted from CNN are fed into 2 layers of Bidirectional LSTM.
In my first network I used few feature maps, so the input to the LSTM layers was 128. However, when I increase the input size (eg 256), I start getting NaNs for the LSTM output after some iterations (it works fine without batch normalization). I understand that this might be related to the division by small numbers. I also used an epsilon of 10^-6, but still getting NaNs.
Any ideas on what can I do to get rid of NaNs? Thanks.
For those who are having the same problem, using float64 data type instead of float32 helps in solving this issue. Of course this has memory implications, but I found it to be the only solution so far.
Related
I have built my own neural network in java.
Its neuron layers follow this structure: 784(inputs), 200, 80, 10(outputs)
After feeding it the MNIST training data in 300 batches of 100 randomly selected images, updating weights and biases after each batch. With a learning rate of .005. However, the network seems to adopt a strategy of giving an output of all zeros every time. Because just saying
{0,0,0,0,0,0,0,0,0,0}is much closer to {0,1,0,0,0,0,0,0,0,0}than any actual guessing strategy that it tried. On occasion, it will attempt to change but can never find a strategy that works better than that of just saying zero for everything. Can anyone tell me how to fix this? Does it need more training data? Does this mean there's an error in the backpropagation function I wrote?
Thanks for any suggestions!
Make sure your data label is integer encoded
Make sure the final Dense layer has 10 units with a softmax activation function.
Compile your model with sparse_categorical_crossentropy as the loss function.
My dataset consists of vectors that are massive. The data points are all mostly zeros with ~3% of the features being 1. Essentially my data is super sparse and I am attempting to train an autoencoder however my model is learning just to recreate vectors of all zeros.
Are there any techniques to prevent this? I have tried replacing mean squared error with dice loss but it completely stopped learning. My other thoughts would be to use a loss function that favors guessing 1s correctly rather than zeros. I have also tried using a sigmoid and linear last activation with no clear winner. Any ideas would be awesome.
It seems like you are facing a severe "class imbalance" problem.
Have a look at focal loss. This loss is designed for binary classification with severe class imbalance.
Consider "hard negative mining": that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016).
I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.
I am trying to implement Neural Networks for classifcation having 5 hidden layers, and with softmax cross entropy in the output layer. The implementation is in JAVA.
For optimization, I have used MiniBatch gradient descent(Batch size=100, learning rate = 0.01)
However, after a couple of iterations, the weights become "NaN" and the predicted values turn out to be the same for every testcase.
Unable to debug the source of this error.
Here is the github link to the code(with the test/training file.)
https://github.com/ahana204/NeuralNetworks
In my case, i forgot to normalize the training data (by subtracting mean). This was causing the denominator of my softmax equation to be 0. Hope this helps.
Assuming the code you implemented is correct, one reason would be large learning rate. If learning rate is large, weights may not converge and may become very small or very large which could be shown NaN. Try to lower learning rate to see if anything changes.
I am trying to build a prediction model, initially I did Variational Autoencoder and reduced the features from 2100 to 64.
Now having (5000 X 64) samples for training and (2000 X 64) for testing with that I tried to build a Fully feed forward or MLP network, but as a result when my mean absolute error reaches 161 it's not going down. I tried varying all hyper-parameters and also the hidden layers but no use.
Can anyone suggest what would be the reason and How I can overcome this problem?
First of all, training a neural network can be a bit tricky. Performance of the network after training (even the training process itself) depends on a large number of factors. Secondly, you have to be more specific about the your dataset (the problem rather) in your question.
Just by looking at your question, what can be said is ...
What is the range of values in your data ? Having a mean absolute error the magnitude of your error being 161 is quite high. It seems like you have large values in your data. (Try normalizing the data, i.e. subtract the mean and divide by variance of each of your features/variables.
How you initialized the weights of your network ? Training performance depends very much on the initial weight values. A wrong initialization can lead to a local minimum. (Try initializing using Glorot's initialization method)
You have reduced the dimensionality from 2100 to 64. Isn't this too much? (actually it might be okay to do but it really depends on your data).