Split RNN Memory Consumption Evenly Between GPUs in TensorFlow - memory

I'm trying to figure out the most strategic way to evenly split the memory load of a seq2seq network between two GPUs.
With convolutional networks, the task is much easier. However, I'm trying to figure out how to maximize the memory usage of 2 Titan X's. The goal is to build the largest network that the combined 24GB of memory will allow.
One idea was to place each RNN layer in a separate GPU.
GPU1 --> RNN Layer 1 & Backward Pass
GPU2 --> RNN Layer 2,3,4
However, the backprop computations require a significant amount of memory. Therefore, another idea is to do the entire forward pass on one GPU and the backward pass on the separate GPU.
GPU1 --> Forward Pass
GPU2 --> Backward Pass
(However, GPU2 still takes most of the memory load)
Is there any way to measure how much of the GPU memory is being used? This would allow us to figure out how to maximize each GPU before it's "filled up".
Once 2 GPUs are used, I would eventually want to use four. However, I think maximizing 2 GPUs is the first step.

Setting "colocate_gradients_with_ops" as TRUE maybe work. It allows GPU memory to allocated evenly.
optimizer = tf.train.AdamOptimizer(learning_rate)
gvs = optimizer.compute_gradients(loss, colocate_gradients_with_ops=True)
train_op = optimizer.apply_gradients(gvs, global_step=self.global_step)

Related

Why is Gradient Checking Slow For Back Propagation?

I recently learned the algorithm "Gradient Checking" for making sure the derivatives of my Neural Network's Back Propagation are calculated properly.
The course from which I have learned, and many other sources such as this one, claim it to be much slower than calculating derivatives, but I can't seem to find anywhere that explains WHY.
So, why is gradient checking slower than calculating the derivative directly?
How much slower is it?
What you are doing in back-propagation is the backwards mode of automatic/algorithmic differentiation for a function that has a very large number N of inputs and only one output. The "inputs" here mean chiefly the real-number parameters of the nodes of the neural nets, possibly also the input variables of the net.
In the backwards mode you compute the derivatives of all inputs in one pass through the chain of operations. This has the cost of about 3 function evaluations plus the organizational overhead to execute the operations chain backwards and store and access the intermediate results.
In the forward mode for the same situation, which you use for the "gradient checking", independent of if you push forward the AD derivatives or compute divided differences, you will need to compute each derivative individually. The total cost of that is about 2*N function evaluations.
And as N is large, 2*N is much larger than 3.

Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.
If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.
the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.
From the following link, at the end of the link
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.
Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).
Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.
Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)
Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

Tensorflow on GPU uses too much memory

I'm trying to train a very basic, small LSTM model in Tensorflow on a GTX 1080 (although I've tried other cards, too). Depending on some parameters (like the hidden state size), I get a ResourceExhaustedError after a pretty regular number of iterations.
The model isn't much more than a an embedding matrix (ca. 5000*300) and a single-layer LSTM, with a final dense projection at every timestep. I have tried with batch sizes as small as 1 and a hidden state size of just 20, but still I run out of memory on an 8G card, with a total training data size of 5M.
I can't wrap my head around why this is happening. I've obviously tried other suggestions to related problems discussed on Stackoverflow, incl. reducing the per_process_gpu_memory_fraction in the TF GPU options, but to no avail.
See code here:
https://pastebin.com/1MSUrt15 (main training script)
https://pastebin.com/U1tYbM8A (model definition)
[This doesn't include some utility scripts, so won't run alone. I also deleted some functions for the sake of shortness. The code is designed for multi-task learning, which introduces some overhead here, but the memory problems persist in single-task setups.]
PS: one thing that I know I'm doing not 100% efficiently is storing all training data as a numpy array, then sampling from there and using TF's feed_dict to provide the data to my model. This, I believe, can slow down computation to some degree, but shouldn't cause such severe memory issues, right?

Can CNN learn to weigh certain feature channels much, much more than others?

This is a hypothetical question.
Assumptions
I am working on a 2 class semantic segmentation task
My ground truths are binary masks
batch size is 1
at an arbitrary point in my network there is a convolution layer called 'conv_5' which has a feature map size of 90 x 45 x 512.
Let's assume I also decide that (during training) I will concatenate the ground truth mask to 'conv_5'. This will result in a new top we can call 'concat_1' which will be a 90 x 45 x 513 dimension feature map.
Assume that the rest of the network follows a normal pattern like a few more convolution layers, a fully connected, and softmax loss.
My question is, can the fully connected layers learn to weigh the first 512 feature channels very low and weigh the last feature channel (which we know is a perfect ground truth) very high?
If this is true then is it true in principle such that if the feature dimension was 1,000,000 channels and I add the last channel as the perfect ground truth it will still learn to effectively ignore all previous 1,000,000 feature channels?
My intuition is that if there is ever a VERY good feature channel passed in then a network should be able to learn to utilize this channel far more than the others. I would also like to think that this is independent to the number of channels.
(In practice I have a scenario where I am passing in a nearly perfect ground truth as the 513th feature map, but it seems to have no impact at all. Then when I examine the magnitudes of the weights across all 513 feature channels, the magnitudes are roughly the same across all channels. This leads me to believe that the "nearly perfect mask" is only being utilized about 1/513th of it's potential. This is what has motivated me to ask the question.)
Hypothetically, if you have a "killing feature" in your disposal, the net should learn to use it and ignore the "noise" from the rest of the features.
BTW, Why are you using a fully connected layer for semantic segmentation? I'm not sure this is "a normal pattern" for semantic segmentation nets.
What may prevent the net from identifying the "killing features"?
- The layers above "conv_5" mess things up: if they reduce resolution (sampling/pooling/striding...) then information is lost, and it is difficult to utilize the information. Specifically, I suspect the fully-connected layer that might globally mess things up.
- A bug: the way you add the "killing feature" is not aligned with the image. Either the mask is added transposed, or you erroneously add the mask of one image to another (do you "shuffle" the training samples?)
An interesting experiment:
You can check if the net has at least a locally optimal weights that uses the "killing features": you can use net surgery to manually set the weights such that "conv_5" is zero for all features but the "killing features" and the weights for the consequent layers are not messing this up. Then you should have very high accuracy and low loss. Training the net from this point should yield very small (if any) gradients and the weights should not change significantly even after many iterations.

Are GPUs good for case-based image filtering?

I am trying to figure out whether a certain problem is a good candidate for using CUDA to put the problem on a GPU.
I am essentially doing a box filter that changes based on some edge detection. So there are basically 8 cases that are tested for for each pixel, and then the rest of the operations happen - typical mean calculations and such. Is the presence of these switch statements in my loop going to cause this problem to be a bad candidate to go to GPU?
I am not sure really how to avoid the switch statements, because this edge detection has to happen at every pixel. I suppose the entire image could have the edge detection part split out from the processing algorithm, and you could store a buffer corresponding to which filter to use for each pixel, but that seems like it would add a lot of pre-processing to the algorithm.
Edit: Just to give some context - this algorithm is already written, and OpenMP has been used to pretty good effect at speeding it up. However, the 8 cores on my development box pales in comparison to the 512 in the GPU.
Edge detection, mean calculations and cross-correlation can be implemented as 2D convolutions. Convolutions can be implemented on GPU very effectively (speed-up > 10, up to 100 with respect to CPU), especially for large kernels. So yes, it may make sense rewriting image filtering on GPU.
Though I wouldn't use GPU as a development platform for such a method.
typically, unless you are on the new CUDA architecture, you will want to avoid branching. because GPUs are basically SIMD machines, the pipleline is extremely vurnurable to, and suffers tremendously from, pipeline stalls due to branch misprediction.
if you think that there are significant benefits to be garnered by using a GPU, do some preliminary benchmarks to get a rough idea.
if you want to learn a bit about how to write non-branching code, head over to http://cellperformance.beyond3d.com/ and have a look.
further, investigating into running this problem on multiple CPU cores might also be worth it, in which case you will probably want to look into either OpenCL or the Intel performance libraries (such as TBB)
another go-to source for problems targeting the GPU be it graphics, computational geometry or otherwise, is IDAV, the Institute for Data Analysis and Visualization: http://idav.ucdavis.edu
Branching is actually not that bad, if there is spatial coherence in the branching. In other words, if you are expecting chunks of pixels next to each other in the image to go through the same branch, the performance hit is minimized.
Using a GPU for processing can often be counter-intuitive; things that are obviously inefficient if done in normal serial code, are actually the best way to do it in parallel using the GPU.
The pseudo-code below looks inefficient (since it computes 8 filtered values for every pixel) but will run efficiently on a GPU:
# Compute the 8 possible filtered values for each pixel
for i = 1...8
# filter[i] is the box filter that you want to apply
# to pixels of the i'th edge-type
result[i] = GPU_RunBoxFilter(filter[i], Image)
# Compute the edge type of each pixel
# This is the value you would normally use to 'switch' with
edge_type = GPU_ComputeEdgeType(Image)
# Setup an empty result image
final_result = zeros(sizeof(Image))
# For each possible switch value, replace all pixels of that edge-type
# with its corresponding filtered value
for i = 1..8
final_result = GPU_ReplacePixelIfTrue(final_result, result[i], edge_type==i)
Hopefully that helps!
Yep, control flow usually has performance penalty on GPU, be it if's / switch'es / ternary operator's, because with control flow operations GPU can't optimally run threads. So usual tactics is to avoid branching as possible. In some cases IF's can be replaced by some formula, where IF conditions maps to formula coefficients. But concrete solution/optimization depends on concrete GPU kernel... Maybe you can show exact code, to be analyzed further by stackoverflow community.
EDIT:
Just in case you are interested here is convolution pixel shader that i wrote.

Resources