Tensorflow on GPU uses too much memory - memory

I'm trying to train a very basic, small LSTM model in Tensorflow on a GTX 1080 (although I've tried other cards, too). Depending on some parameters (like the hidden state size), I get a ResourceExhaustedError after a pretty regular number of iterations.
The model isn't much more than a an embedding matrix (ca. 5000*300) and a single-layer LSTM, with a final dense projection at every timestep. I have tried with batch sizes as small as 1 and a hidden state size of just 20, but still I run out of memory on an 8G card, with a total training data size of 5M.
I can't wrap my head around why this is happening. I've obviously tried other suggestions to related problems discussed on Stackoverflow, incl. reducing the per_process_gpu_memory_fraction in the TF GPU options, but to no avail.
See code here:
https://pastebin.com/1MSUrt15 (main training script)
https://pastebin.com/U1tYbM8A (model definition)
[This doesn't include some utility scripts, so won't run alone. I also deleted some functions for the sake of shortness. The code is designed for multi-task learning, which introduces some overhead here, but the memory problems persist in single-task setups.]
PS: one thing that I know I'm doing not 100% efficiently is storing all training data as a numpy array, then sampling from there and using TF's feed_dict to provide the data to my model. This, I believe, can slow down computation to some degree, but shouldn't cause such severe memory issues, right?

Related

How to estimate a CoreML model's maximal runtime footprint (in megabytes)

Let's say I have a network model made in TensorFlow/Keras/Caffe etc.
I can use CoreML Converters API to get a CoreML model file (.mlmodel) from it.
Now, as I have a .mlmodel file, and know input shape and output shape, how can a maximum RAM footprint be estimated?
I know that a model сan have a lot of layers, their size can be much bigger than input/output shape.
So the questions are:
Can be a maximal mlmodel memory footprint be known with some formula/API, without compiling and running an app?
Is a maximal footprint closer to a memory size of the biggest intermediate layer, or is it closer to a sum of the all layer's sizes?
Any advice is appreciated.
As I am new to CoreML, you may give any feedback and I'll try to improve the question if needed.
IMHO, whatever formula you come up with at the end of the day must be based on the number of trainable parameters of the network.
For classifying networks it can be found by iterating or the existing API can be used.
In keras.
import keras.applications.resnet50 as resnet
model =resnet.ResNet50(include_top=True, weights=None, input_tensor=None, input_shape=None, pooling=None, classes=2)
print model.summary()
Total params: 23,591,810
Trainable params: 23,538,690
Non-trainable params: 53,120
Pytorch:
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
For the detectors, you probably need to do the same for all the important parts of the network, including the backbone, rpn, etc., whatever your network consists of.
The second important parameter is the precision of the network. You must be heard about quantization. It changes the precision of floats for all or some layers and can be static (when the network is trained in desired precision and calibrated) or dynamic when the network is converted after the training. The simplest dynamic quantization replaces floats to some kind of ints on linear layers. Maskrcnn in pytorch results in 30% smaller file size and a substantial reduction in memory consumption with the same number of trainable parameters.
So the final equation is like size = number_of_trainable_parameters * precision * X, where X is some factor you have to find out for your particular network and coreml specifics )
I wrote a blog post a few years ago that goes into some of this: https://machinethink.net/blog/how-fast-is-my-model/
However, keep in mind that Core ML's actual behavior is not known. It will most likely try to be as efficient as possible (i.e. reuse the memory for tensors that are no longer needed) but it's a black box, so who knows. The only way to find out is to try your models on an actual device.

Autoencoder Failing to Capture Small Artifacts

tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.

Regarding the consistency of convolutional neural networks

I am currently building a 2-channel (also called double-channel) convolutional neural network in order to classify 2 binary images (containing binary objects) as 'similar' or 'different'.
The problem I am having is that it seems as though the network doesn't always converge to the same solution. For example, I can use exactly the same ordering of training pairs and all the same parameters and so forth, and when I run the network multiple times, each time produces a different solution; sometimes converging to below 2% error rates, and other times I get 50% error rates.
I have a feeling that it has something to do with the random initialization of the weights of the network, which results in different optimization paths each time the network is executed. This issue even occurs when I use SGD with momentum, so I don't really know how to 'force' the network to converge to the same solution (global optima) every time?
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
There are several sources of randomness in training.
Initialization is one. SGD itself is of course stochastic since the content of the minibatches is often random. Sometimes, layers like dropout are inherently random too. The only way to ensure getting identical results is to fix the random seed for all of them.
Given all these sources of randomness and a model with many millions of parameters, your quote
"I don't really know how to 'force' the network to converge to the same solution (global optima) every time?"
is something pretty much something anyone should say - no one knows how to find the same solution every time, or even a local optima, let alone the global optima.
Nevertheless, ideally, it is desirable to have the network perform similarly across training attempts (with fixed hyper-parameters and dataset). Anything else is going to cause problems in reproducibility, of course.
Unfortunately, I suspect the problem is inherent to CNNs.
You may be aware of the bias-variance tradeoff. For a powerful model like a CNN, the bias is likely to be low, but the variance very high. In other words, CNNs are sensitive to data noise, initialization, and hyper-parameters. Hence, it's not so surprising that training the same model multiple times yields very different results. (I also get this phenomenon, with performances changing between training runs by as much as 30% in one project I did.) My main suggestion to reduce this is stronger regularization.
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
As I mentioned, this problem is present inherently for deep models to an extent. However, your use of binary images may also be a factor, since the space of the data itself is rather discontinuous. Perhaps consider "softening" the input (e.g. filtering the inputs) and using data augmentation. A similar approach is known to help in label smoothing, for example.

Neural Network gets stuck

I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem

Finetuning Caffe Deep Learning Check failed: error == cudaSuccess out of memory

I am relatively new in Deep learning and its framework. Currently, I am experimenting with Caffe framework and trying to fine tune the Vgg16_places_365.
I am using the Amazone EC2 instance g2.8xlarge with 4 GPUs (each has 4 GB of RAM). However, when I try to train my model (using a single GPU), I got this error:
Check failed: error == cudaSuccess (2 vs. 0) out of memory
After I did some research, I found that one of the ways to solve this out of memory problem is by reducing the batch size in my train.prototxt
Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory.
Initially, I set the batch size into 50, and iteratively reduced it until 10 (since it worked when batch_size = 10).
Now, the model is being trained and I am pretty sure it will take quite long time. However, as a newcomer in this domain, I am curious about the relation between this batch size and another parameter such as the learning rate, stepsize and even the max iteration that we specify in the solver.prototxt.
How significant the size of the batch will affect the quality of the model (like accuracy may be). How the other parameters can be used to leverage the quality. Also, instead of reducing the batch size or scale up my machine, is there another way to fix this problem?
To answer your first question regarding the relationship between parameters such as batch size, learning rate and maximum number of iterations, you are best of reading about the mathematical background. A good place to start might be this stats.stackexchange question: How large should the batch size be for stochastic gradient descent?. The answer will briefly discuss the relation between batch size and learning rate (from your question I assume learning rate = stepsize) and also provide some references for further reading.
To answer your last question, with the dataset you are finetuning on and the model (i.e. the VGG16) being fixed (i.e. the input data of fixed size, and the model of fixed size), you will have a hard time avoiding the out of memory problem for large batch sizes. However, if you are willing to reduce the input size or the model size you might be able to use larger batch sizes. Depending on how (and what) exactly you are finetuning, reducing the model size may already be achieved by discarding learned layers or reducing the number/size of fully connected layers.
The remaining questions, i.e. how significant the batchsize influences quality/accuracy and how other parameters influence quality/accuracy, are hard to answer without knowing the concrete problem you are trying to solve. The influence of e.g. the batchsize on the achieved accuracy might depend on various factors such as the noise in your dataset, the dimensionality of your dataset, the size of your dataset as well as other parameters such as learning rate (=stepsize) or momentum parameter. For these sort of questions, I recommend the textbook by Goodfellow et al., e.g. chapter 11 may provide some general guidelines on choosing these hyperparmeters (i.e. batchsize, learning rate etc.).
another way to solve your problem is using all the GPUs on your machine. If you have 4x4=16GB RAM on your GPUs, that would be enough. If you are running caffe in command mode, just add the --gpu argument as follows (assuming you have 4 GPUs indexed as default 0,1,2,3):
build/tools/caffe train --solver=solver.prototxt --gpu=0,1,2,3
However if you are using the python interface, running with multiple GPUs is not yet supported.
I can point out some general hints to answer your question on the batchsize:
- The smaller the batchsize is, the more stochastic your learning would be --> less probability of overfitting on the training data; higher probability of not converging.
- each iteration in caffe fetches one batch of data, runs forward and ends with a backpropagation.
- Let's say your training data is 50'000 and your batchsize is 10; then in 1000 iterations, 10'000 of your data has been fed to the network. In the same scenario scenario, if your batchsize is 50, in 1000 iterations, all your training data are seen by the network. This is called one epoch. You should design your batchsize and maximum iterations in a way that your network is trained for a certain number of epochs.
- stepsize in caffe, is the number of iterations your solver will run before multiplying the learning rate with the gamma value (if you have set your training approach as "step").

Resources