Why does Tensorflow CNN use too much memory? - machine-learning

I'm new to deep learning and I'm using Tensorflow train a cnn to do image recognition. The training images are all 128 pixels * 128 pixels * 3 channels. In my network, there are 3 conv layers, 3 maxpooling layers and 1 fully connected layers. I have more than 180,000 labeled images so I decided to train 4000 images each batch. However, the training process can not even run on my laptop because the memory is not enough so I tried to use a sever with 64gb ram and 2 * E5 CPU to train it. This time it works, but costs more than 40GB of memory.
I'm confused that the images I used for training are not high resolution images(only 128*128). But why still costs too much memory(may be the batch size it too big....). Is this normal? If this is normal, how could people use gpu to train neural networks, as far as i know, gtx1080ti has 11GB memory, but still not enough for training my network.

4000 sounds like a lot in one go. Most examples I've seen train a few hundred in each batch. I imagine that all the images may be getting loaded into memory at once, hence the high memory usage.
Can you try training with smaller batches? 1000, or even 500, and see if the memory usage drops?

For TensorFlow will store the middle value(gradients between layers for example) of the computing for BP, so the larger size of the input the more memory it consumes. So reduce your batch size is the fast way to reduce the memory usage.

Related

Artificially increasing the size of a dataset through duplication?

I'm working on a machine learning project where I'm using a neural network to solve a binary classification problem, however, my dataset(in .csv format) is relatively small. It only has around 60 yes/no cases and although it was able to train, the accuracy wasn't very good. My solution to that was just duplicating the dataset and on each duplication, making tiny changes to the numbers, i.e., adding +-1 or multiplying by 0.999 to each number. By doing this I grew the size of the dataset to around 1100 new cases and it achieved much higher levels of accuracy. I was wondering if this is an actual technique used by ML researchers and if it is, does it have an actual official/academic name?
Thank You!
Yes, the process you are referring to is called data augmentation.
However, I would highly recommend you to not use neural networks on datasets with merely hundred to thousand rows. Ideally Neural networks are used to train models over large datasets.

Multi GPU training for Transformers with different GPUs

I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.
My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.
Will my model size limit essentially be sum of all available GPU memory? (35GB?)
I’m not interested in doing this in AWS.
You already solved your problem. That's great. I would like to point out a different approach and address a few questions.
Will my model size limit essentially be sum of all available GPU
memory? (35GB?)
This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.
The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.
The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)
If you want to understand in detail how it works, here is the paper:
https://arxiv.org/pdf/1910.02054.pdf
More information on integrating DeepSpeed with Huggingface can be found here:
https://huggingface.co/docs/transformers/main_classes/deepspeed
PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.

Choice of infrastructure for faster deep learning model training with tensorflow?

I am a newbie in deep learning with tensorflow.I am trying out a seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq
model with satisfactory accuracy?
Also,the minimum infrastructure setup required in terms of memory
and cpu capability to train this deep learning model within a max
time of a few hours.
My experience has been training a seq2seq model to build a neural network with
2 layers of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor.
took around 1 day to train on a 8GB RAM,3GHz Intel i5 single core
processor.
Which helps the most for faster training - more RAM capacity, multiple CPU cores or a CPU + GPU combination core?
Disclaimer: I'm also new, and could be wrong on a lot of this.
I am a newbie in deep learning with tensorflow.I am trying out a
seq2seq model sample code.
I wanted to understand:
What is the minimum values of number of layers, layer size and batch
size that I could start off with to be able to test the seq2seq model
with satisfactory accuracy?
I think that this will just have to be up to your experimentation. Find out what works for your data set. I have heard a few pieces of advice: don't pick your own architecture if you can - find someone else's that is tried and tested. Seems deeper networks are better than wider if you're going to choose between the too. I also think bigger batch sizes are better if you have the memory. I've heard to maximize network size and then regularize so you don't overfit.
I have the impression these are big questions that no one really knows the answer to (could be very wrong about this!). We'd all love a smart way of choosing layer size / number of layers, but no one knows exactly how changing these things affects training.
Also,the minimum infrastructure setup required in terms of memory and cpu capability to train this deep
learning model within a max time of a few hours.
Depending on your model, that could be an unreasonable request. Seems like some models train for hundreds if not thousands of hours (on GPUs).
My experience has
been training a seq2seq model to build a neural network with 2 layers
of size 900 and batch size 4
took around 3 days to train on a 4GB RAM,3GHz Intel i5 single core
processor. took around 1 day to train on a 8GB RAM,3GHz Intel i5
single core processor. Which helps the most for faster training - more
RAM capacity, multiple CPU cores or a CPU + GPU combination core?
I believe a GPU will help you the most. I have seen some stuff that uses the CPU (asynchronous actor critic or something? They didn't use locking) where it seemed like CPU was better, but I think GPU will give you huge speedups.

How to train neural network on large training set and small memory

I write my own neural net library with backpropagation using gpu computing.
Want to make it universal, that I dont must check if the training set fits to the gpu memory.
How do you train a neural net, when the training set is too large to fit in gpu memory?
I assume that it fits in RAM of the host.
Must I do the train iteration on the firts piece, then deallocate it on the device and send the second piece to the device and train on that, so on ...
And then sum up the gradient results.
Is it not too slow, when i must push all the data trough the PCIe bus?
Have you a better idea?
Use minibatch gradient descent: in a loop,
send a batch of samples to the GPU
compute error, backprop gradient
adjust parameters.
Repeat this loop several times until the network converges.
This is not exactly equivalent to the naive batch learning algorithm (batch gradient descent): in fact it usually converges faster than batch learning. It helps if you randomly shuffle the samples before each training loop. So you still have the memory transfers, but you don't need as many iterations and the algorithm will run faster.

OpenCV Iterative random forest training

I'm using the random forest algorithm as the classifier of my thesis project.
The training set consists of thousands of images, and for each image about 2000
pixels get sampled. For each pixel, I've hundred of thousands of features. With
my current hardware limitations (8G of ram, possibly extendable to 16G) I'm able
to fit in memory the samples (i.e. features per pixel) for only one image. My
questions is: is it possible to call multiple times the train method, each time
with a different image's samples, and get the statistical model automatically
updated at each call? I'm particularly interested in the variable importance since, after I
train the full training set with the whole features set, my idea is to reduce
the number of features from hundred of thousands to about 2000, keeping only the
most important ones.
Thank you for any advice,
Daniele
I dont think the algorithm supports incremental training. You could consider reducing the size of your descriptors prior to training, using other feature reduction method. Or estimate the variable importance on a random subset of pixels taken among all your training images, as much as you can stuff into your memory...
See my answer to this post. There are incremental versions of random forests, and they will let you train on much larger data.

Resources