How to compute the number of parameters in the second conversional layer? - machine-learning

I code a CNN model with Mnist in Keras. The code and print its summary like this:
code for cnn:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(63, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, name='dense', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
model summary:
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
conv2d_2 (Conv2D) (None, 24, 24, 63) 18207
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 63) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 12, 12, 63) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 9072) 0
_________________________________________________________________
dense (Dense) (None, 128) 1161344
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 1290
=================================================================
Total params: 1,181,161
Trainable params: 1,181,161
Non-trainable params: 0
_________________________________________________________________
The kernel_size of the first and the second conv2D layers are all (3,3).
I don't understand why there are 18207 parameters in the second conv2D layer. Shouldn't it be computed like (3*3+1)*63=630?

To get the Number of parameters you need to apply the following formula:
(FxFxCi+1)xC0
Where FxF is the kernel size, C0 the output Channels and Ci the input channels.
So in your case you are just forgetting the input channels parameter:
18207 = 63*(32*3*3+1)
Edit to answer comments:
when you have the output of the first layer you obtain an "image" of shape: (None, 26, 26, 32) (None being the batch size).
So intuitively you will need to learn kernels for every dimension (channel) and as such will need a kernel for every dimension, and then map it to the output dimension. The output dimension is depending on the parameters of the kernel but also the number of kernels:
Convolutions are usually computed for each channel and summed. So for exemple you have a (28,28,3) pic with a conv of 3 kernels (5,5,3) and your output will be a (24,24) pic (1 Output Channel). You have 1 kernel for every dimension which you then sum to get the output.
But you can also have multiple convolutions:
you still have the same pic (28,28,3) but then have a convolutional layer of size (5,5,3,4). Meaning that you have 4 of the convolution we describe above. to get an output of size (24,24,4) you don't sum the conv, you stack them to get a picture with multiple channel. You learn multiple independent convolutions at the same time.
So you see where the calculation comes from. And why the input channels are indeed very important, as are the output ones. But they represent very different parameters. (see this for a more detail & visual explanation)

Related

Keras model not predicting values in the Test set

I'm building a Keras model to predict predict if the user will select the certain product or not (binary classification).
Model seems to be making progress on Validation set that is heldout while training, but the model's predictions are all 0s when it comes to the test set.
My dataset looks something like this:
train_dataset
customer_id id target customer_num_id
0 TCHWPBT 4 0 1
1 TCHWPBT 13 0 1
2 TCHWPBT 20 0 1
3 TCHWPBT 23 0 1
4 TCHWPBT 28 0 1
... ... ... ... ...
1631695 D4Q7TMM 849 0 7417
1631696 D4Q7TMM 855 0 7417
1631697 D4Q7TMM 856 0 7417
1631698 D4Q7TMM 858 0 7417
1631699 D4Q7TMM 907 0 7417
I split it into Train/Val sets using:
from sklearn.model_selection import train_test_split
Train, Val = train_test_split(train_dataset, test_size=0.1, random_state=42, shuffle=False)
After I split the dataset, I select the features that are used when training and validating the model:
train_customer_id = Train['customer_num_id']
train_vendor_id = Train['id']
train_target = Train['target']
val_customer_id = Val['customer_num_id']
val_vendor_id = Val['id']
val_target = Val['target']
... And run the model:
epochs = 2
for e in range(epochs):
print('EPOCH: ', e)
model.fit([train_customer_id, train_vendor_id], train_target, epochs=1, verbose=1, batch_size=384)
prediction = model.predict(x=[train_customer_id, train_vendor_id], verbose=1, batch_size=384)
train_f1 = f1_score(y_true=train_target.astype('float32'), y_pred=prediction.round())
print('TRAIN F1: ', train_f1)
val_prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
val_f1 = f1_score(y_true=val_target.astype('float32'), y_pred=val_prediction.round())
print('VAL F1: ', val_f1)
EPOCH: 0
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0891
TRAIN F1: 0.1537511577647422
VAL F1: 0.09745762711864409
EPOCH: 1
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0691
TRAIN F1: 0.308748569645272
VAL F1: 0.2076433121019108
The validation accuracy seems to be improving with time, and model predicts both 1s and 0s:
prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0., 1.], dtype=float32)
But when I try predict the test set, model predicts 0 for all values:
prediction = model.predict(x=[test_dataset['customer_num_id'], test_dataset['id']], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0.], dtype=float32)
Test dataset looks similar to the training and validation sets, and it has been left out during training just like the validation set, yet the model can't output values other than 0.
Here's what test dataset looks like:
test_dataset
customer_id id customer_num_id
0 Z59FTQD 243 7418
1 0JP29SK 243 7419
... ... ... ...
1671995 L9G4OFV 907 17414
1671996 L9G4OFV 907 17414
1671997 FDZFYBA 907 17415
Does anyone know what might be the issue here?
EDIT: made dataset text more readable
Please take a look at the distribution of your data. I see in the sample data you've shown that target is all 0's. Consider that if most users don't select the product, then if the model always predicts 0, it will be right most of the time. So, it could be improving it's accuracy by over-fitting to the majority class (0).
You can prevent over-fitting by adjusting params like the learning rate and model architecture by adding dropout layers.
Also, I'm not sure what your model looks like, but you're only training for 2 epochs so it may not have had enough time to generalize the data, and depending on how deep your model is it could need a lot more training time

Caffe - Average accuracy other N last iterations

I'm training a neural network using Caffe. In the solver.prototxt file, I can set average_loss to print the loss averaged over last N iterations. Is it possible to do so using other values as well ?
For example, I wrote a custom PythonLayer outputting accuracy, and I would like to display the average accuracy over the last N iterations as well.
Thanks,
EDIT: here is the log. The DEBUG lines show the accuracy computed at each image, and every 3 images (average_loss: 3 and display: 3), the accuracy is displayed with the loss. We see that only the last one is displayed, what I want is the average of the 3).
2018-04-24 10:38:06,383 [DEBUG]: Accuracy: 0 / 524288 = 0.000000
I0424 10:38:07.517436 99964 solver.cpp:251] Iteration 0, loss = 1.84883e+06
I0424 10:38:07.517503 99964 solver.cpp:267] Train net output #0: accuracy = 0
I0424 10:38:07.517521 99964 solver.cpp:267] Train net output #1: loss = 1.84883e+06 (* 1 = 1.84883e+06 loss)
I0424 10:38:07.517536 99964 sgd_solver.cpp:106] Iteration 0, lr = 2e-12
I0424 10:38:07.524904 99964 solver.cpp:287] Time: 2.44301s/1iters
2018-04-24 10:38:08,653 [DEBUG]: Accuracy: 28569 / 524288 = 0.054491
2018-04-24 10:38:11,010 [DEBUG]: Accuracy: 22219 / 524288 = 0.042379
2018-04-24 10:38:13,326 [DEBUG]: Accuracy: 168424 / 524288 = 0.321243
I0424 10:38:14.533329 99964 solver.cpp:251] Iteration 3, loss = 1.84855e+06
I0424 10:38:14.533406 99964 solver.cpp:267] Train net output #0: accuracy = 0.321243
I0424 10:38:14.533426 99964 solver.cpp:267] Train net output #1: loss = 1.84833e+06 (* 1 = 1.84833e+06 loss)
I0424 10:38:14.533440 99964 sgd_solver.cpp:106] Iteration 3, lr = 2e-12
I0424 10:38:14.534195 99964 solver.cpp:287] Time: 7.01088s/3iters
2018-04-24 10:38:15,665 [DEBUG]: Accuracy: 219089 / 524288 = 0.417879
2018-04-24 10:38:17,943 [DEBUG]: Accuracy: 202896 / 524288 = 0.386993
2018-04-24 10:38:20,210 [DEBUG]: Accuracy: 0 / 524288 = 0.000000
I0424 10:38:21.393121 99964 solver.cpp:251] Iteration 6, loss = 1.84769e+06
I0424 10:38:21.393190 99964 solver.cpp:267] Train net output #0: accuracy = 0
I0424 10:38:21.393210 99964 solver.cpp:267] Train net output #1: loss = 1.84816e+06 (* 1 = 1.84816e+06 loss)
I0424 10:38:21.393224 99964 sgd_solver.cpp:106] Iteration 6, lr = 2e-12
I0424 10:38:21.393940 99964 solver.cpp:287] Time: 6.85962s/3iters
2018-04-24 10:38:22,529 [DEBUG]: Accuracy: 161180 / 524288 = 0.307426
2018-04-24 10:38:24,801 [DEBUG]: Accuracy: 178021 / 524288 = 0.339548
2018-04-24 10:38:27,090 [DEBUG]: Accuracy: 208571 / 524288 = 0.397818
I0424 10:38:28.297776 99964 solver.cpp:251] Iteration 9, loss = 1.84482e+06
I0424 10:38:28.297843 99964 solver.cpp:267] Train net output #0: accuracy = 0.397818
I0424 10:38:28.297863 99964 solver.cpp:267] Train net output #1: loss = 1.84361e+06 (* 1 = 1.84361e+06 loss)
I0424 10:38:28.297878 99964 sgd_solver.cpp:106] Iteration 9, lr = 2e-12
I0424 10:38:28.298607 99964 solver.cpp:287] Time: 6.9049s/3iters
I0424 10:38:28.331749 99964 solver.cpp:506] Snapshotting to binary proto file snapshot/train_iter_10.caffemodel
I0424 10:38:36.171842 99964 sgd_solver.cpp:273] Snapshotting solver state to binary proto file snapshot/train_iter_10.solverstate
I0424 10:38:43.068686 99964 solver.cpp:362] Optimization Done.
Caffe only averages over average_loss iteration the global loss of the net (the weighted sum of all loss layers) while reporting the output of only the last batch for all other output blobs.
Therefore, if you want your Python layer to report accuracy averaged over several iterations, I suggest you store a buffer SS a member of your layer class and display this aggregated value.
Alternatively, you can implement a "moving average" on top of the accuracy calculation and output this value as a "top".
You can have a "moving average output layer" implemented in python.
This layer can take any number of "bottoms" and output the moving average of these bottoms.
Python code of layer:
import caffe
class MovingAverageLayer(caffe.Layer):
def setup(self, bottom, top):
assert len(bottom) == len(top), "layer must have same number of inputs and outputs"
# average over how many iterations? read from param_str
self.buf_size = int(self.param_str)
# allocate a buffer for each "bottom"
self.buf = [[] for _ in self.bottom]
def reshape(self, bottom, top):
# make sure inputs and outputs have the same size
for i, b in enumerate(bottom):
top[i].reshape(*b.shape)
def forward(self, bottom, top):
# put into buffers
for i, b in enumerate(bottom):
self.buf[i].append(b.data.copy())
if len(self.buf[i]) > self.buf_size:
self.buf[i].pop(0)
# compute average
a = 0
for elem in self.buf[i]:
a += elem
top[i].data[...] = a / len(self.buf[i])
def backward(self, top, propagate_down, bottom):
# this layer does not back prop
pass
How to use this layer in prototxt:
layer {
name: "moving_ave"
type: "Python"
bottom: "accuracy"
top: "av_accuracy"
python_param {
layer: "MovingAverageLayer"
module: "path.to.module"
param_str: "30" # buf size
}
}
See this tutorial for more information.
Original incorrect answer:
Caffe outputs to log whatever the net outputs: loss, accuracy or any other blob that appears as "top" of a layer and is not used as a "bottom" in any other layer.
Therefore, if you want to see accuracy computed by a "Python" layer, simply make sure no other layer uses this accuracy as an input.

CNN on word vectors throws input dimension error

I have a dataframe with approximately 14560 word vectors of dimension 400. I have reshaped each vector in 20*20 and used 1 channel for applying a CNN so the dimension has become (14560,20,20,1). When I try to fit the CNN model it throws an error.
Code:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.layers import BatchNormalization
from keras.utils import np_utils
from keras import backend as K
model_cnn=Sequential()
model_cnn.add(Convolution2D(filters = 16, kernel_size = (3, 3),
activation='relu',input_shape = (20, 20,1)))
model_cnn.compile(loss='categorical_crossentropy', optimizer = 'adadelta',
metrics=["accuracy"])
model_cnn.fit(x_tr_,y_tr_,validation_data=(x_te_,y_te))
Error:
Error when checking target: expected conv2d_6 to have 4 dimensions,
but got array with shape (14560, 1). When I reshape train data to
(14560,1,20,20) still it gives error as model receives input
=(1,20,20) and required is (20,20,1).
How do I fix it ?
Problem
The problem is not only with x_tr shape, which should be (-1,20,20,1) as correctly pointed out in another answer. It's also the network architecture itself. If you do model_cnn.summary(), you'll see the following:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 18, 18, 16) 160
=================================================================
Total params: 160
Trainable params: 160
Non-trainable params: 0
The output of the model is rank 4: (batch_size, 18, 18, 16). It can't compute the loss when the labels are (batch_size, 1).
Solution
The correct architecture must reshape the convolutional output tensor (batch_size, 18, 18, 16) to (batch_size, 1). There can be many ways to do it, here's one:
model_cnn = Sequential()
model_cnn.add(Convolution2D(filters=16, kernel_size=(3, 3), activation='relu', input_shape=(20, 20, 1)))
model_cnn.add(MaxPool2D(pool_size=18))
model_cnn.add(Flatten())
model_cnn.add(Dense(units=1))
model_cnn.compile(loss='sparse_categorical_crossentropy', optimizer='adadelta', metrics=["accuracy"])
The summary:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 18, 18, 16) 160
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 1, 1, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 177
Trainable params: 177
Non-trainable params: 0
Note that I added max-pooling to reduce 18x18 feature maps to 1x1, then flatten layer to squeeze the tensor to (None, 16) and finally the dense layer to output a single value. Also pay attention to the loss function: it's sparse_categorical_crossentropy. If you wish to do categorical_crossentropy, you have to do one-hot encoding and output not a single number, but the probability distribution over classes: (None, classes).
By the way, also check that your validation arrays have valid shape.

Keras uses way too much GPU memory when calling train_on_batch, fit, etc

I've been messing with Keras, and like it so far. There's one big issue I have been having, when working with fairly deep networks: When calling model.train_on_batch, or model.fit etc., Keras allocates significantly more GPU memory than what the model itself should need. This is not caused by trying to train on some really large images, it's the network model itself that seems to require a lot of GPU memory. I have created this toy example to show what I mean. Here's essentially what's going on:
I first create a fairly deep network, and use model.summary() to get the total number of parameters needed for the network (in this case 206538153, which corresponds to about 826 MB). I then use nvidia-smi to see how much GPU memory Keras has allocated, and I can see that it makes perfect sense (849 MB).
I then compile the network, and can confirm that this does not increase GPU memory usage. And as we can see in this case, I have almost 1 GB of VRAM available at this point.
Then I try to feed a simple 16x16 image and a 1x1 ground truth to the network, and then everything blows up, because Keras starts allocating lots of memory again, for no reason that is obvious to me. Something about training the network seems to require a lot more memory than just having the model, which doesn't make sense to me. I have trained significantly deeper networks on this GPU in other frameworks, so that makes me think that I'm using Keras wrong (or there's something wrong in my setup, or in Keras, but of course that's hard to know for sure).
Here's the code:
from scipy import misc
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Reshape, Flatten, ZeroPadding2D, Dropout
import os
model = Sequential()
model.add(Convolution2D(256, 3, 3, border_mode='same', input_shape=(16,16,1)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Convolution2D(256, 3, 3, border_mode='same'))
model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(4))
model.add(Dense(1))
model.summary()
os.system("nvidia-smi")
raw_input("Press Enter to continue...")
model.compile(optimizer='sgd',
loss='mse',
metrics=['accuracy'])
os.system("nvidia-smi")
raw_input("Compiled model. Press Enter to continue...")
n_batches = 1
batch_size = 1
for ibatch in range(n_batches):
x = np.random.rand(batch_size, 16,16,1)
y = np.random.rand(batch_size, 1)
os.system("nvidia-smi")
raw_input("About to train one iteration. Press Enter to continue...")
model.train_on_batch(x, y)
print("Trained one iteration")
Which gives the following output for me:
Using Theano backend.
Using gpu device 0: GeForce GTX 960 (CNMeM is disabled, cuDNN 5103)
/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
warnings.warn(warn)
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
convolution2d_1 (Convolution2D) (None, 16, 16, 256) 2560 convolution2d_input_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 8, 8, 256) 0 convolution2d_1[0][0]
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D) (None, 8, 8, 512) 1180160 maxpooling2d_1[0][0]
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 4, 4, 512) 0 convolution2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D) (None, 4, 4, 1024) 4719616 maxpooling2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_4[0][0]
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_5[0][0]
____________________________________________________________________________________________________
convolution2d_7 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_6[0][0]
____________________________________________________________________________________________________
convolution2d_8 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_7[0][0]
____________________________________________________________________________________________________
convolution2d_9 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_8[0][0]
____________________________________________________________________________________________________
convolution2d_10 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_9[0][0]
____________________________________________________________________________________________________
convolution2d_11 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_10[0][0]
____________________________________________________________________________________________________
convolution2d_12 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_11[0][0]
____________________________________________________________________________________________________
convolution2d_13 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_12[0][0]
____________________________________________________________________________________________________
convolution2d_14 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_13[0][0]
____________________________________________________________________________________________________
convolution2d_15 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_14[0][0]
____________________________________________________________________________________________________
convolution2d_16 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_15[0][0]
____________________________________________________________________________________________________
convolution2d_17 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_16[0][0]
____________________________________________________________________________________________________
convolution2d_18 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_17[0][0]
____________________________________________________________________________________________________
convolution2d_19 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_18[0][0]
____________________________________________________________________________________________________
convolution2d_20 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_19[0][0]
____________________________________________________________________________________________________
convolution2d_21 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_20[0][0]
____________________________________________________________________________________________________
convolution2d_22 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_21[0][0]
____________________________________________________________________________________________________
convolution2d_23 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_22[0][0]
____________________________________________________________________________________________________
convolution2d_24 (Convolution2D) (None, 4, 4, 1024) 9438208 convolution2d_23[0][0]
____________________________________________________________________________________________________
maxpooling2d_3 (MaxPooling2D) (None, 2, 2, 1024) 0 convolution2d_24[0][0]
____________________________________________________________________________________________________
convolution2d_25 (Convolution2D) (None, 2, 2, 256) 2359552 maxpooling2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_26 (Convolution2D) (None, 2, 2, 32) 73760 convolution2d_25[0][0]
____________________________________________________________________________________________________
maxpooling2d_4 (MaxPooling2D) (None, 1, 1, 32) 0 convolution2d_26[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten) (None, 32) 0 maxpooling2d_4[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (None, 4) 132 flatten_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 5 dense_1[0][0]
====================================================================================================
Total params: 206538153
____________________________________________________________________________________________________
None
Thu Oct 6 09:05:42 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 0000:01:00.0 On | N/A |
| 30% 37C P2 28W / 120W | 1082MiB / 2044MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1796 G /usr/bin/X 155MiB |
| 0 2597 G compiz 65MiB |
| 0 5966 C python 849MiB |
+-----------------------------------------------------------------------------+
Press Enter to continue...
Thu Oct 6 09:05:44 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 0000:01:00.0 On | N/A |
| 30% 38C P2 28W / 120W | 1082MiB / 2044MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1796 G /usr/bin/X 155MiB |
| 0 2597 G compiz 65MiB |
| 0 5966 C python 849MiB |
+-----------------------------------------------------------------------------+
Compiled model. Press Enter to continue...
Thu Oct 6 09:05:44 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 0000:01:00.0 On | N/A |
| 30% 38C P2 28W / 120W | 1082MiB / 2044MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1796 G /usr/bin/X 155MiB |
| 0 2597 G compiz 65MiB |
| 0 5966 C python 849MiB |
+-----------------------------------------------------------------------------+
About to train one iteration. Press Enter to continue...
Error allocating 37748736 bytes of device memory (out of memory). Driver report 34205696 bytes free and 2144010240 bytes total
Traceback (most recent call last):
File "memtest.py", line 65, in <module>
model.train_on_batch(x, y)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 712, in train_on_batch
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1221, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 717, in __call__
return self.function(*inputs)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 871, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 314, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 859, in __call__
outputs = self.fn()
MemoryError: Error allocating 37748736 bytes of device memory (out of memory).
Apply node that caused the error: GpuContiguous(GpuDimShuffle{3,2,0,1}.0)
Toposort index: 338
Inputs types: [CudaNdarrayType(float32, 4D)]
Inputs shapes: [(1024, 1024, 3, 3)]
Inputs strides: [(1, 1024, 3145728, 1048576)]
Inputs values: ['not shown']
Outputs clients: [[GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0}), GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
A few things to note:
I have tried both Theano and TensorFlow backends. Both have the same problems, and run out of memory at the same line. In TensorFlow, it seems that Keras preallocates a lot of memory (about 1.5 GB) so nvidia-smi doesn't help us track what's going on there, but I get the same out-of-memory exceptions. Again, this points towards an error in (my usage of) Keras (although it's hard to be certain about such things, it could be something with my setup).
I tried using CNMEM in Theano, which behaves like TensorFlow: It preallocates a large amount of memory (about 1.5 GB) yet crashes in the same place.
There are some warnings about the CudNN-version. I tried running the Theano backend with CUDA but not CudNN and I got the same errors, so that is not the source of the problem.
If you want to test this on your own GPU, you might want to make the network deeper/shallower depending on how much GPU memory you have to test this.
My configuration is as follows: Ubuntu 14.04, GeForce GTX 960, CUDA 7.5.18, CudNN 5.1.3, Python 2.7, Keras 1.1.0 (installed via pip)
I've tried changing the compilation of the model to use different optimizers and losses, but that doesn't seem to change anything.
I've tried changing the train_on_batch function to use fit instead, but it has the same problem.
I saw one similar question here on StackOverflow - Why does this Keras model require over 6GB of memory? - but as far as I can tell, I don't have those issues in my configuration. I've never had multiple versions of CUDA installed, and I've double checked my PATH, LD_LIBRARY_PATH and CUDA_ROOT variables more times than I can count.
Julius suggested that the activation parameters themselves take up GPU memory. If this is true, can somebody explain it a bit more clearly? I have tried changing the activation function of my convolution layers to functions that are clearly hard-coded with no learnable parameters as far as I can tell, and that doesn't change anything. Also, it seems unlikely that these parameters would take up almost as much memory as the rest of the network itself.
After thorough testing, the largest network I can train is about 453 MB of parameters, out of my ~2 GB of GPU RAM. Is this normal?
After testing Keras on some smaller CNNs that do fit in my GPU, I can see that there are very sudden spikes in GPU RAM usage. If I run a network with about 100 MB of parameters, 99% of the time during training it'll be using less than 200 MB of GPU RAM. But every once in a while, memory usage spikes to about 1.3 GB. It seems safe to assume that it's these spikes that are causing my problems. I've never seen these spikes in other frameworks, but they might be there for a good reason? If anybody knows what causes them, and if there's a way to avoid them, please chime in!
It is a very common mistake to forget that the activations, gradients and optimizer moment tracking variables also take VRRAM, not just the parameters, increasing memory usage quite a bit. The backprob calculations themselves make it so the training phase takes almost double the VRAM of forward / inference use of the neural net, and the Adam optimizer triples the space usage.
So, in the beginning when the network is created, only the parameters are allocated. However, when the training starts. the model actiavtions, backprop computations and the optimizer's tracking variables get allocated, increasing memory use by a large factor.
To allow the training of larger models, people:
use model parallelism to spread the weights and computations over different accelerators
use gradient checkpointing, which allows a tradeoff between more computation vs lower memory use during back-propagation.
Potentially use a memory efficient optimizer that aims to reduce the number of tracking variables, such as Adafactor, for which you will find implementations for all popular deep learning frameworks.
Tools to train very large models:
Mesh-Tensorflow https://arxiv.org/abs/1811.02084
https://github.com/tensorflow/mesh
Microsoft DeepSpeed:
https://github.com/microsoft/DeepSpeed https://www.deepspeed.ai/
Facebook FairScale: https://github.com/facebookresearch/fairscale
Megatron-LM: https://arxiv.org/abs/1909.08053
https://github.com/NVIDIA/Megatron-LM
Article on integration in HuggingFace Transformers: https://huggingface.co/blog/zero-deepspeed-fairscale
Both Theano and Tensorflow augments the symbolic graph that is created, though both differently.
To analyze how the memory consumption is happening you can start with a smaller model and grow it to see the corresponding growth in memory. Similarly you can grow the batch_size to see the corresponding growth in memory.
Here is a code snippet for increasing batch_size based on your initial code:
from scipy import misc
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Reshape, Flatten, ZeroPadding2D, Dropout
import os
import matplotlib.pyplot as plt
def gpu_memory():
out = os.popen("nvidia-smi").read()
ret = '0MiB'
for item in out.split("\n"):
if str(os.getpid()) in item and 'python' in item:
ret = item.strip().split(' ')[-2]
return float(ret[:-3])
gpu_mem = []
gpu_mem.append(gpu_memory())
model = Sequential()
model.add(Convolution2D(100, 3, 3, border_mode='same', input_shape=(16,16,1)))
model.add(Convolution2D(256, 3, 3, border_mode='same'))
model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(4))
model.add(Dense(1))
model.summary()
gpu_mem.append(gpu_memory())
model.compile(optimizer='sgd',
loss='mse',
metrics=['accuracy'])
gpu_mem.append(gpu_memory())
batches = []
n_batches = 20
batch_size = 1
for ibatch in range(n_batches):
batch_size = (ibatch+1)*10
batches.append(batch_size)
x = np.random.rand(batch_size, 16,16,1)
y = np.random.rand(batch_size, 1)
print y.shape
model.train_on_batch(x, y)
print("Trained one iteration")
gpu_mem.append(gpu_memory())
fig = plt.figure()
plt.plot([-100, -50, 0]+batches, gpu_mem)
plt.show()
Also, for speed Tensorflow hogs up the all available GPU memory. To stop that and you need to add config.gpu_options.allow_growth = True in get_session()
# keras/backend/tensorflow_backend.py
def get_session():
global _SESSION
if tf.get_default_session() is not None:
session = tf.get_default_session()
else:
if _SESSION is None:
if not os.environ.get('OMP_NUM_THREADS'):
config = tf.ConfigProto(allow_soft_placement=True,
)
else:
nb_thread = int(os.environ.get('OMP_NUM_THREADS'))
config = tf.ConfigProto(intra_op_parallelism_threads=nb_thread,
allow_soft_placement=True)
config.gpu_options.allow_growth = True
_SESSION = tf.Session(config=config)
session = _SESSION
if not _MANUAL_VAR_INIT:
_initialize_variables()
return session
Now if you run the prev snippet you get plots like:
Theano:
Tensorflow:
Theano: After model.compile() whatever the memory is needed, during the start of training, it almost doubles. This is because Theano augments the symbolic graph to do back-propagation and each tensor needs a corresponding tensor to achieve the backward flow of gradients. The memory needs don't seem to grow with batch_size and this is unexpected to me as the placeholder size should increase to accommodate the data inflow from CPU->GPU.
Tensorflow: No GPU memory is allocated even after model.compile() as Keras don't call get_session() till that time which actually calls _initialize_variables(). Tensorflow seems to hog memory in chunks for speed and so the memory don't grow linearly with batch_size.
Having said all that Tensorflow seems to be memory hungry but for big graphs its very fast.. Theano on the other hand is very gpu memory efficient but takes a hell lot of time to initialize the graph at the start of training. After that its also pretty fast.
200M params for 2 Gb GPU is toooo much. Also your architecture not efficient, using local bottlenecks will be more efficient.
Also you should go from small model to big, and not backwards, right now you have input 16x16, with this architecture that means that at the end most of your network will be "zero padded" and not based on input features.
Your model layers depends on your input, so you cant just set arbitrary number of layers and sizes, you need count how much data will be passed to each of them, with understanding why are doing so.
I would recommend you to watch this free course http://cs231n.github.io

How to think about weights in Myrrix

I have the following input for Myrrix:
11, 101, 1
11, 102, 1
11, 103, 1
11, 104, 1000
11, 105, 1000
11, 106, 1000
12, 101, 1
12, 102, 1
12, 103, 1
12, 222, 1
13, 104, 1000
13, 105, 1000
13, 106, 1000
13, 333, 1000
I am looking for items to recommend to user 11. The expectation is that item 333 will be recommended first (because of the higher weights for user 13 and items 104, 105, 106).
Here are the recommendation results from Myrrix:
11, 222, 0.04709
11, 333, 0.0334058
Notice that item 222 is recommended with strength 0.047, but item 333 is only given a strength of 0.033 --- the opposite of the expected results.
I also would have expected the difference in strength to be larger (since 1000 and 1 are so different), but obviously that's moot when the order isn't even what I expected.
How can I interpret these results and how should I think about the weight parameter? We are working with a large client under a tight deadline and would appreciate any pointers.
It's hard to judge based on a small and synthetic data set. I think the biggest factor will be parameters here -- what are the # of features? lambda? I would expect features = 2 here. If it's higher I think you quickly over-fit this and the results are mostly the noise left over from that after it perfectly explains that user 11 doesn't interact with 222 and 333.
The values are quite low, suggesting both of these are not likely results, and so their order may be more noise than anything. Do you see different results if the model is rebuilt from another random starting point?

Resources