I'm training a neural network using Caffe. In the solver.prototxt file, I can set average_loss to print the loss averaged over last N iterations. Is it possible to do so using other values as well ?
For example, I wrote a custom PythonLayer outputting accuracy, and I would like to display the average accuracy over the last N iterations as well.
Thanks,
EDIT: here is the log. The DEBUG lines show the accuracy computed at each image, and every 3 images (average_loss: 3 and display: 3), the accuracy is displayed with the loss. We see that only the last one is displayed, what I want is the average of the 3).
2018-04-24 10:38:06,383 [DEBUG]: Accuracy: 0 / 524288 = 0.000000
I0424 10:38:07.517436 99964 solver.cpp:251] Iteration 0, loss = 1.84883e+06
I0424 10:38:07.517503 99964 solver.cpp:267] Train net output #0: accuracy = 0
I0424 10:38:07.517521 99964 solver.cpp:267] Train net output #1: loss = 1.84883e+06 (* 1 = 1.84883e+06 loss)
I0424 10:38:07.517536 99964 sgd_solver.cpp:106] Iteration 0, lr = 2e-12
I0424 10:38:07.524904 99964 solver.cpp:287] Time: 2.44301s/1iters
2018-04-24 10:38:08,653 [DEBUG]: Accuracy: 28569 / 524288 = 0.054491
2018-04-24 10:38:11,010 [DEBUG]: Accuracy: 22219 / 524288 = 0.042379
2018-04-24 10:38:13,326 [DEBUG]: Accuracy: 168424 / 524288 = 0.321243
I0424 10:38:14.533329 99964 solver.cpp:251] Iteration 3, loss = 1.84855e+06
I0424 10:38:14.533406 99964 solver.cpp:267] Train net output #0: accuracy = 0.321243
I0424 10:38:14.533426 99964 solver.cpp:267] Train net output #1: loss = 1.84833e+06 (* 1 = 1.84833e+06 loss)
I0424 10:38:14.533440 99964 sgd_solver.cpp:106] Iteration 3, lr = 2e-12
I0424 10:38:14.534195 99964 solver.cpp:287] Time: 7.01088s/3iters
2018-04-24 10:38:15,665 [DEBUG]: Accuracy: 219089 / 524288 = 0.417879
2018-04-24 10:38:17,943 [DEBUG]: Accuracy: 202896 / 524288 = 0.386993
2018-04-24 10:38:20,210 [DEBUG]: Accuracy: 0 / 524288 = 0.000000
I0424 10:38:21.393121 99964 solver.cpp:251] Iteration 6, loss = 1.84769e+06
I0424 10:38:21.393190 99964 solver.cpp:267] Train net output #0: accuracy = 0
I0424 10:38:21.393210 99964 solver.cpp:267] Train net output #1: loss = 1.84816e+06 (* 1 = 1.84816e+06 loss)
I0424 10:38:21.393224 99964 sgd_solver.cpp:106] Iteration 6, lr = 2e-12
I0424 10:38:21.393940 99964 solver.cpp:287] Time: 6.85962s/3iters
2018-04-24 10:38:22,529 [DEBUG]: Accuracy: 161180 / 524288 = 0.307426
2018-04-24 10:38:24,801 [DEBUG]: Accuracy: 178021 / 524288 = 0.339548
2018-04-24 10:38:27,090 [DEBUG]: Accuracy: 208571 / 524288 = 0.397818
I0424 10:38:28.297776 99964 solver.cpp:251] Iteration 9, loss = 1.84482e+06
I0424 10:38:28.297843 99964 solver.cpp:267] Train net output #0: accuracy = 0.397818
I0424 10:38:28.297863 99964 solver.cpp:267] Train net output #1: loss = 1.84361e+06 (* 1 = 1.84361e+06 loss)
I0424 10:38:28.297878 99964 sgd_solver.cpp:106] Iteration 9, lr = 2e-12
I0424 10:38:28.298607 99964 solver.cpp:287] Time: 6.9049s/3iters
I0424 10:38:28.331749 99964 solver.cpp:506] Snapshotting to binary proto file snapshot/train_iter_10.caffemodel
I0424 10:38:36.171842 99964 sgd_solver.cpp:273] Snapshotting solver state to binary proto file snapshot/train_iter_10.solverstate
I0424 10:38:43.068686 99964 solver.cpp:362] Optimization Done.
Caffe only averages over average_loss iteration the global loss of the net (the weighted sum of all loss layers) while reporting the output of only the last batch for all other output blobs.
Therefore, if you want your Python layer to report accuracy averaged over several iterations, I suggest you store a buffer SS a member of your layer class and display this aggregated value.
Alternatively, you can implement a "moving average" on top of the accuracy calculation and output this value as a "top".
You can have a "moving average output layer" implemented in python.
This layer can take any number of "bottoms" and output the moving average of these bottoms.
Python code of layer:
import caffe
class MovingAverageLayer(caffe.Layer):
def setup(self, bottom, top):
assert len(bottom) == len(top), "layer must have same number of inputs and outputs"
# average over how many iterations? read from param_str
self.buf_size = int(self.param_str)
# allocate a buffer for each "bottom"
self.buf = [[] for _ in self.bottom]
def reshape(self, bottom, top):
# make sure inputs and outputs have the same size
for i, b in enumerate(bottom):
top[i].reshape(*b.shape)
def forward(self, bottom, top):
# put into buffers
for i, b in enumerate(bottom):
self.buf[i].append(b.data.copy())
if len(self.buf[i]) > self.buf_size:
self.buf[i].pop(0)
# compute average
a = 0
for elem in self.buf[i]:
a += elem
top[i].data[...] = a / len(self.buf[i])
def backward(self, top, propagate_down, bottom):
# this layer does not back prop
pass
How to use this layer in prototxt:
layer {
name: "moving_ave"
type: "Python"
bottom: "accuracy"
top: "av_accuracy"
python_param {
layer: "MovingAverageLayer"
module: "path.to.module"
param_str: "30" # buf size
}
}
See this tutorial for more information.
Original incorrect answer:
Caffe outputs to log whatever the net outputs: loss, accuracy or any other blob that appears as "top" of a layer and is not used as a "bottom" in any other layer.
Therefore, if you want to see accuracy computed by a "Python" layer, simply make sure no other layer uses this accuracy as an input.
I have a dataframe with approximately 14560 word vectors of dimension 400. I have reshaped each vector in 20*20 and used 1 channel for applying a CNN so the dimension has become (14560,20,20,1). When I try to fit the CNN model it throws an error.
Code:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.layers import BatchNormalization
from keras.utils import np_utils
from keras import backend as K
model_cnn=Sequential()
model_cnn.add(Convolution2D(filters = 16, kernel_size = (3, 3),
activation='relu',input_shape = (20, 20,1)))
model_cnn.compile(loss='categorical_crossentropy', optimizer = 'adadelta',
metrics=["accuracy"])
model_cnn.fit(x_tr_,y_tr_,validation_data=(x_te_,y_te))
Error:
Error when checking target: expected conv2d_6 to have 4 dimensions,
but got array with shape (14560, 1). When I reshape train data to
(14560,1,20,20) still it gives error as model receives input
=(1,20,20) and required is (20,20,1).
How do I fix it ?
Problem
The problem is not only with x_tr shape, which should be (-1,20,20,1) as correctly pointed out in another answer. It's also the network architecture itself. If you do model_cnn.summary(), you'll see the following:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 18, 18, 16) 160
=================================================================
Total params: 160
Trainable params: 160
Non-trainable params: 0
The output of the model is rank 4: (batch_size, 18, 18, 16). It can't compute the loss when the labels are (batch_size, 1).
Solution
The correct architecture must reshape the convolutional output tensor (batch_size, 18, 18, 16) to (batch_size, 1). There can be many ways to do it, here's one:
model_cnn = Sequential()
model_cnn.add(Convolution2D(filters=16, kernel_size=(3, 3), activation='relu', input_shape=(20, 20, 1)))
model_cnn.add(MaxPool2D(pool_size=18))
model_cnn.add(Flatten())
model_cnn.add(Dense(units=1))
model_cnn.compile(loss='sparse_categorical_crossentropy', optimizer='adadelta', metrics=["accuracy"])
The summary:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 18, 18, 16) 160
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 1, 1, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 177
Trainable params: 177
Non-trainable params: 0
Note that I added max-pooling to reduce 18x18 feature maps to 1x1, then flatten layer to squeeze the tensor to (None, 16) and finally the dense layer to output a single value. Also pay attention to the loss function: it's sparse_categorical_crossentropy. If you wish to do categorical_crossentropy, you have to do one-hot encoding and output not a single number, but the probability distribution over classes: (None, classes).
By the way, also check that your validation arrays have valid shape.
I've been messing with Keras, and like it so far. There's one big issue I have been having, when working with fairly deep networks: When calling model.train_on_batch, or model.fit etc., Keras allocates significantly more GPU memory than what the model itself should need. This is not caused by trying to train on some really large images, it's the network model itself that seems to require a lot of GPU memory. I have created this toy example to show what I mean. Here's essentially what's going on:
I first create a fairly deep network, and use model.summary() to get the total number of parameters needed for the network (in this case 206538153, which corresponds to about 826 MB). I then use nvidia-smi to see how much GPU memory Keras has allocated, and I can see that it makes perfect sense (849 MB).
I then compile the network, and can confirm that this does not increase GPU memory usage. And as we can see in this case, I have almost 1 GB of VRAM available at this point.
Then I try to feed a simple 16x16 image and a 1x1 ground truth to the network, and then everything blows up, because Keras starts allocating lots of memory again, for no reason that is obvious to me. Something about training the network seems to require a lot more memory than just having the model, which doesn't make sense to me. I have trained significantly deeper networks on this GPU in other frameworks, so that makes me think that I'm using Keras wrong (or there's something wrong in my setup, or in Keras, but of course that's hard to know for sure).
Here's the code:
from scipy import misc
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Reshape, Flatten, ZeroPadding2D, Dropout
import os
model = Sequential()
model.add(Convolution2D(256, 3, 3, border_mode='same', input_shape=(16,16,1)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Convolution2D(256, 3, 3, border_mode='same'))
model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(4))
model.add(Dense(1))
model.summary()
os.system("nvidia-smi")
raw_input("Press Enter to continue...")
model.compile(optimizer='sgd',
loss='mse',
metrics=['accuracy'])
os.system("nvidia-smi")
raw_input("Compiled model. Press Enter to continue...")
n_batches = 1
batch_size = 1
for ibatch in range(n_batches):
x = np.random.rand(batch_size, 16,16,1)
y = np.random.rand(batch_size, 1)
os.system("nvidia-smi")
raw_input("About to train one iteration. Press Enter to continue...")
model.train_on_batch(x, y)
print("Trained one iteration")
Which gives the following output for me:
Using Theano backend.
Using gpu device 0: GeForce GTX 960 (CNMeM is disabled, cuDNN 5103)
/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
warnings.warn(warn)
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
convolution2d_1 (Convolution2D) (None, 16, 16, 256) 2560 convolution2d_input_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 8, 8, 256) 0 convolution2d_1[0][0]
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D) (None, 8, 8, 512) 1180160 maxpooling2d_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 4, 4, 512) 0 convolution2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D) (None, 4, 4, 1024) 4719616 maxpooling2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_4[0][0]
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_5[0][0]
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_6[0][0]
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_7[0][0]
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_8[0][0]
____________________________________________________________________________________________________
convolution2d_10 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_9[0][0]
____________________________________________________________________________________________________
convolution2d_11 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_10[0][0]
____________________________________________________________________________________________________
convolution2d_12 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_11[0][0]
____________________________________________________________________________________________________
convolution2d_13 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_12[0][0]
____________________________________________________________________________________________________
convolution2d_14 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_13[0][0]
____________________________________________________________________________________________________
convolution2d_15 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_14[0][0]
____________________________________________________________________________________________________
convolution2d_16 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_15[0][0]
____________________________________________________________________________________________________
convolution2d_17 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_16[0][0]
____________________________________________________________________________________________________
convolution2d_18 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_17[0][0]
____________________________________________________________________________________________________
convolution2d_19 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_18[0][0]
____________________________________________________________________________________________________
convolution2d_20 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_19[0][0]
____________________________________________________________________________________________________
convolution2d_21 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_20[0][0]
____________________________________________________________________________________________________
convolution2d_22 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_21[0][0]
____________________________________________________________________________________________________
convolution2d_23 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_22[0][0]
____________________________________________________________________________________________________
convolution2d_24 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_23[0][0]
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D) (None, 2, 2, 1024) 0 convolution2d_24[0][0]
____________________________________________________________________________________________________
convolution2d_25 (Convolution2D) (None, 2, 2, 256) 2359552 maxpooling2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_26 (Convolution2D) (None, 2, 2, 32) 73760 convolution2d_25[0][0]
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D) (None, 1, 1, 32) 0 convolution2d_26[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten) (None, 32) 0 maxpooling2d_4[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (None, 4) 132 flatten_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 5 dense_1[0][0]
====================================================================================================
Total params: 206538153
____________________________________________________________________________________________________
None
Thu Oct 6 09:05:42 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 0000:01:00.0 On | N/A |
| 30% 37C P2 28W / 120W | 1082MiB / 2044MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1796 G /usr/bin/X 155MiB |
| 0 2597 G compiz 65MiB |
| 0 5966 C python 849MiB |
+-----------------------------------------------------------------------------+
Press Enter to continue...
Thu Oct 6 09:05:44 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 0000:01:00.0 On | N/A |
| 30% 38C P2 28W / 120W | 1082MiB / 2044MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1796 G /usr/bin/X 155MiB |
| 0 2597 G compiz 65MiB |
| 0 5966 C python 849MiB |
+-----------------------------------------------------------------------------+
Compiled model. Press Enter to continue...
Thu Oct 6 09:05:44 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 0000:01:00.0 On | N/A |
| 30% 38C P2 28W / 120W | 1082MiB / 2044MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1796 G /usr/bin/X 155MiB |
| 0 2597 G compiz 65MiB |
| 0 5966 C python 849MiB |
+-----------------------------------------------------------------------------+
About to train one iteration. Press Enter to continue...
Error allocating 37748736 bytes of device memory (out of memory). Driver report 34205696 bytes free and 2144010240 bytes total
Traceback (most recent call last):
File "memtest.py", line 65, in <module>
model.train_on_batch(x, y)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 712, in train_on_batch
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1221, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 717, in __call__
return self.function(*inputs)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 871, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 314, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 859, in __call__
outputs = self.fn()
MemoryError: Error allocating 37748736 bytes of device memory (out of memory).
Apply node that caused the error: GpuContiguous(GpuDimShuffle{3,2,0,1}.0)
Toposort index: 338
Inputs types: [CudaNdarrayType(float32, 4D)]
Inputs shapes: [(1024, 1024, 3, 3)]
Inputs strides: [(1, 1024, 3145728, 1048576)]
Inputs values: ['not shown']
Outputs clients: [[GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
A few things to note:
I have tried both Theano and TensorFlow backends. Both have the same problems, and run out of memory at the same line. In TensorFlow, it seems that Keras preallocates a lot of memory (about 1.5 GB) so nvidia-smi doesn't help us track what's going on there, but I get the same out-of-memory exceptions. Again, this points towards an error in (my usage of) Keras (although it's hard to be certain about such things, it could be something with my setup).
I tried using CNMEM in Theano, which behaves like TensorFlow: It preallocates a large amount of memory (about 1.5 GB) yet crashes in the same place.
There are some warnings about the CudNN-version. I tried running the Theano backend with CUDA but not CudNN and I got the same errors, so that is not the source of the problem.
If you want to test this on your own GPU, you might want to make the network deeper/shallower depending on how much GPU memory you have to test this.
My configuration is as follows: Ubuntu 14.04, GeForce GTX 960, CUDA 7.5.18, CudNN 5.1.3, Python 2.7, Keras 1.1.0 (installed via pip)
I've tried changing the compilation of the model to use different optimizers and losses, but that doesn't seem to change anything.
I've tried changing the train_on_batch function to use fit instead, but it has the same problem.
I saw one similar question here on StackOverflow - Why does this Keras model require over 6GB of memory? - but as far as I can tell, I don't have those issues in my configuration. I've never had multiple versions of CUDA installed, and I've double checked my PATH, LD_LIBRARY_PATH and CUDA_ROOT variables more times than I can count.
Julius suggested that the activation parameters themselves take up GPU memory. If this is true, can somebody explain it a bit more clearly? I have tried changing the activation function of my convolution layers to functions that are clearly hard-coded with no learnable parameters as far as I can tell, and that doesn't change anything. Also, it seems unlikely that these parameters would take up almost as much memory as the rest of the network itself.
After thorough testing, the largest network I can train is about 453 MB of parameters, out of my ~2 GB of GPU RAM. Is this normal?
After testing Keras on some smaller CNNs that do fit in my GPU, I can see that there are very sudden spikes in GPU RAM usage. If I run a network with about 100 MB of parameters, 99% of the time during training it'll be using less than 200 MB of GPU RAM. But every once in a while, memory usage spikes to about 1.3 GB. It seems safe to assume that it's these spikes that are causing my problems. I've never seen these spikes in other frameworks, but they might be there for a good reason? If anybody knows what causes them, and if there's a way to avoid them, please chime in!
It is a very common mistake to forget that the activations, gradients and optimizer moment tracking variables also take VRRAM, not just the parameters, increasing memory usage quite a bit. The backprob calculations themselves make it so the training phase takes almost double the VRAM of forward / inference use of the neural net, and the Adam optimizer triples the space usage.
So, in the beginning when the network is created, only the parameters are allocated. However, when the training starts. the model actiavtions, backprop computations and the optimizer's tracking variables get allocated, increasing memory use by a large factor.
To allow the training of larger models, people:
use model parallelism to spread the weights and computations over different accelerators
use gradient checkpointing, which allows a tradeoff between more computation vs lower memory use during back-propagation.
Potentially use a memory efficient optimizer that aims to reduce the number of tracking variables, such as Adafactor, for which you will find implementations for all popular deep learning frameworks.
Tools to train very large models:
Mesh-Tensorflow https://arxiv.org/abs/1811.02084
https://github.com/tensorflow/mesh
Microsoft DeepSpeed:
https://github.com/microsoft/DeepSpeed https://www.deepspeed.ai/
Facebook FairScale: https://github.com/facebookresearch/fairscale
Megatron-LM: https://arxiv.org/abs/1909.08053
https://github.com/NVIDIA/Megatron-LM
Article on integration in HuggingFace Transformers: https://huggingface.co/blog/zero-deepspeed-fairscale
Both Theano and Tensorflow augments the symbolic graph that is created, though both differently.
To analyze how the memory consumption is happening you can start with a smaller model and grow it to see the corresponding growth in memory. Similarly you can grow the batch_size to see the corresponding growth in memory.
Here is a code snippet for increasing batch_size based on your initial code:
from scipy import misc
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Reshape, Flatten, ZeroPadding2D, Dropout
import os
import matplotlib.pyplot as plt
def gpu_memory():
out = os.popen("nvidia-smi").read()
ret = '0MiB'
for item in out.split("\n"):
if str(os.getpid()) in item and 'python' in item:
ret = item.strip().split(' ')[-2]
return float(ret[:-3])
gpu_mem = []
gpu_mem.append(gpu_memory())
model = Sequential()
model.add(Convolution2D(100, 3, 3, border_mode='same', input_shape=(16,16,1)))
model.add(Convolution2D(256, 3, 3, border_mode='same'))
model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(4))
model.add(Dense(1))
model.summary()
gpu_mem.append(gpu_memory())
model.compile(optimizer='sgd',
loss='mse',
metrics=['accuracy'])
gpu_mem.append(gpu_memory())
batches = []
n_batches = 20
batch_size = 1
for ibatch in range(n_batches):
batch_size = (ibatch+1)*10
batches.append(batch_size)
x = np.random.rand(batch_size, 16,16,1)
y = np.random.rand(batch_size, 1)
print y.shape
model.train_on_batch(x, y)
print("Trained one iteration")
gpu_mem.append(gpu_memory())
fig = plt.figure()
plt.plot([-100, -50, 0]+batches, gpu_mem)
plt.show()
Also, for speed Tensorflow hogs up the all available GPU memory. To stop that and you need to add config.gpu_options.allow_growth = True in get_session()
# keras/backend/tensorflow_backend.py
def get_session():
global _SESSION
if tf.get_default_session() is not None:
session = tf.get_default_session()
else:
if _SESSION is None:
if not os.environ.get('OMP_NUM_THREADS'):
config = tf.ConfigProto(allow_soft_placement=True,
)
else:
nb_thread = int(os.environ.get('OMP_NUM_THREADS'))
config = tf.ConfigProto(intra_op_parallelism_threads=nb_thread,
allow_soft_placement=True)
config.gpu_options.allow_growth = True
_SESSION = tf.Session(config=config)
session = _SESSION
if not _MANUAL_VAR_INIT:
_initialize_variables()
return session
Now if you run the prev snippet you get plots like:
Theano:
Tensorflow:
Theano: After model.compile() whatever the memory is needed, during the start of training, it almost doubles. This is because Theano augments the symbolic graph to do back-propagation and each tensor needs a corresponding tensor to achieve the backward flow of gradients. The memory needs don't seem to grow with batch_size and this is unexpected to me as the placeholder size should increase to accommodate the data inflow from CPU->GPU.
Tensorflow: No GPU memory is allocated even after model.compile() as Keras don't call get_session() till that time which actually calls _initialize_variables(). Tensorflow seems to hog memory in chunks for speed and so the memory don't grow linearly with batch_size.
Having said all that Tensorflow seems to be memory hungry but for big graphs its very fast.. Theano on the other hand is very gpu memory efficient but takes a hell lot of time to initialize the graph at the start of training. After that its also pretty fast.
200M params for 2 Gb GPU is toooo much. Also your architecture not efficient, using local bottlenecks will be more efficient.
Also you should go from small model to big, and not backwards, right now you have input 16x16, with this architecture that means that at the end most of your network will be "zero padded" and not based on input features.
Your model layers depends on your input, so you cant just set arbitrary number of layers and sizes, you need count how much data will be passed to each of them, with understanding why are doing so.
I would recommend you to watch this free course http://cs231n.github.io