This is my train.prototxt. And this is my deploy.prototxt.
When I want to load my deploy file I get this error:
File "./python/caffe/classifier.py", line 29, in __init__
in_ = self.inputs[0]
IndexError: list index out of range
So, I removed the data layer:
F1117 23:16:09.485153 21910 insert_splits.cpp:35] Unknown bottom blob 'data' (layer 'conv1', bottom index 0)
*** Check failure stack trace: ***
Than, I removed bottom: "data" from conv1 layer.
After it, I got this error:
F1117 23:17:15.363919 21935 insert_splits.cpp:35] Unknown bottom blob 'label' (layer 'loss', bottom index 1)
*** Check failure stack trace: ***
I removed bottom: "label" from loss layer. And I got this error:
I1117 23:19:11.171021 21962 layer_factory.hpp:76] Creating layer conv1
I1117 23:19:11.171036 21962 net.cpp:110] Creating Layer conv1
I1117 23:19:11.171041 21962 net.cpp:433] conv1 -> conv1
F1117 23:19:11.171061 21962 layer.hpp:379] Check failed: MinBottomBlobs() <= bottom.size() (1 vs. 0) Convolution Layer takes at least 1 bottom blob(s) as input.
*** Check failure stack trace: ***
What should I do to fix it and create my deploy file?
There are two main differences between a "train" prototxt and a "deploy" one:
1. Inputs: While for training data is fixed to a pre-processed training dataset (lmdb/HDF5 etc.), deploying the net require it to process other inputs in a more "random" fashion.
Therefore, the first change is to remove the input layers (layers that push "data" and "labels" during TRAIN and TEST phases). To replace the input layers you need to add the following declaration:
input: "data"
input_shape: { dim:1 dim:3 dim:224 dim:224 }
This declaration does not provide the actual data for the net, but it tells the net what shape to expect, allowing caffe to pre-allocate necessary resources.
2. Loss: the top most layers in a training prototxt define the loss function for the training. This usually involve the ground truth labels. When deploying the net, you no longer have access to these labels. Thus loss layers should be converted to "prediction" outputs. For example, a "SoftmaxWithLoss" layer should be converted to a simple "Softmax" layer that outputs class probability instead of log-likelihood loss. Some other loss layers already have predictions as inputs, thus it is sufficient just to remove them.
Update: see this tutorial for more information.
Besides advices from #Shai, you may also want to disable the dropout layers.
Although Jia Yangqing, author of Caffe once said that dropout layers have negligible impact on the testing results (google group conversation, 2014), other Deeplearning tools suggest to disable dropout in the deploy phase (for example, lasange).
Related
I am trying to create a hybrid model which is consists of EfficientNetB7 and LSTM.
# pretrained model act as a feature extractor
Effnet=tensorflow.keras.applications.EfficientNetB7( input_shape=(IMG_SIZE,IMG_SIZE,3), include_top=False,weights="imagenet",pooling="avg")
Effnet.trainable = False
x = Flatten()(Effnet.output)
x=(BatchNormalization())(x)
#add two LSTM Layers
x=LSTM(8,input_shape=(IMG_SIZE,IMG_SIZE,3),return_sequences=False)(x)
x=LSTM(8)(x)
x=(BatchNormalization())(x)
#add two fully connected dense layers 1024 as my model
x=Dense(1024)(x)
x=(BatchNormalization())(x)
x=Activation('relu')(x)
x=Dense(1024)(x)
x=(BatchNormalization())(x)
x=Activation('relu')(x)
x = Dense(NUM_CLASSE)(x)
x=(BatchNormalization())(x)
prediction =Activation('softmax')(x)
model = Model(inputs=Effnet.input, outputs=prediction)
model.summary()
But it gives me the following error
and the EfficientNetB7 is average pooling is, I think it is causing the problem, how do I remove it?
ValueError: Input 0 of layer "lstm_6" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 2560)
How can i fix it, please? Thank you, Regards!
The problem is on this line:
x=LSTM(8,input_shape=(IMG_SIZE,IMG_SIZE,3),return_sequences=False)(x)
You defined that the LSTM layers expect input of dimension 3. However, that only hold for the very beginning of your network, which flows into EfficientNetB7. When you have the last output from EfficientNet, you flatten it and get a 1D tensor.
The error message is actually pretty straightforward.
expected ndim=3, found ndim=2. Full shape received: (None, 2560)
2560 comes from flattening the features, and the first dimension is the one for batch size.
You must correct the input to your LSTM layer. If you do not specify anything, keras might just figure it out itself.
I used pre-trained GoogLeNet and then fine tuned it on my dataset for binary classification problem. Validation dataset seems to give the "loss3/top1" 98.5%. But when I evaluating the performance on my evaluation dataset it gives me 50% accuracy. Whatever changes I did it train_val.prototxt, I did the same changes in deploy.prototxt and I am not sure what changes should I do in these lines.
name: "GoogleNet"
layer {
name: "data"
type: "input"
top: "data"
input_param { shape: { dim:10 dim:3 dim:224 dim:224 } }
}
Any suggestions???
You do not need to change anything further in your deploy.prototxt*, but in the way you feed the data to the net. You must transform your evaluation images in the same way you transformed your training/validation images.
See, for example, how classifier.py puts the input images through a properly initialized caffe.io.Transformer class.
The "Input" layer you have in the prototxt is merely a declaration for caffe to allocate memory according to an input blob of shape 10-by-3-by-224-by-224.
* of course, you must verify that train_val.prototxt and deploy.prototxt are exactly the same (apart from the input layer(s) and loss layer(s)): that includes making sure layer names are identical as caffe uses layer names to assign weights from 'caffemodel' file to the actual parameters it loads. Mismatching names will cause caffe to use random weights for some of the layers.
I ran caffe and got this output:
who can tell me what is the problem?
I will really appreciate!!
It seems like one (or more) of your label values are invalid, see this PR for information:
If you have an invalid ground truth label, "SoftmaxWithLoss" will silently access invalid memory [...] The old check only worked in DEBUG mode and also only worked for CPU.
Make sure your prediction vector length matches the number of labels you try to predict.
From your comments, it seems like you have labels in the range 0..10575, but on the other hand, your classification layer, "fc7" only predicts probabilities for 1000 classes. Thus, "SoftmaxWithLoss" layer tries to compute the loss for predicting label l>1000, and access memory outside the probability array, resulting with a segmentation fault.
This is really weird. I'm implementing this model:
Except that I read data from a text file using an ImageData blob, batch_size: 1. There are only two labels and the text file is organized as usual
/home/.../pathToFile 0
...
/home/.../pathToFile 1
Still, Caffe only trains and tests label 0!
I run caffe using the regular tool.
./build/tools/caffe train --solver=solver.prototxt
When I open the net in pycaffe I get this message for the first time ever:
WARNING: Logging before InitGoogleLogging() is written to STDERR
and the size of the
net.blobs['label'].data
is now 1, when it should be 2!
Not only that but that label seems to be a float rather than an integer.
In: net.blobs['label'].data
Out: array([ 0.], dtype=float32)
I know that this has worked before, I just can't get my head around what I'm doing wrong or where to begin troubleshoot.
The output shape of your network depends on the input batch_size: if you define batch_size: 1 than your net processes a single example each time, thus it only reads a single label. If you change batch_size to 2, caffe will read two samples and consequently the shape of label will become 2.
One exception to this "shape rule" is the loss output: the loss defines a scalar function with respect to which gradients are computed. Thus, the loss output will always be a scalar regardless of the input shape.
Regarding the data type of label: Caffe stores all variables in "Blobs" of type float32.
I've noticed that a frequent occurrence during training is NANs being introduced.
Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.
Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?
The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?
I came across this phenomenon several times. Here are my observations:
Gradient blow up
Reason: large gradients throw the learning process off-track.
What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.
What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.
Bad learning rate policy and params
Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.
What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:
... sgd_solver.cpp:106] Iteration 0, lr = -nan
What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.
Faulty Loss function
Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.
What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.
For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.
Faulty input
Reason: you have an input with nan in it!
What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.
stride larger than kernel size in "Pooling" layer
For some reason, choosing stride > kernel_size for pooling may results with nans. For example:
layer {
name: "faulty_pooling"
type: "Pooling"
bottom: "x"
top: "y"
pooling_param {
pool: AVE
stride: 5
kernel: 3
}
}
results with nans in y.
Instabilities in "BatchNorm"
It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.
Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.
In my case, not setting the bias in the convolution/deconvolution layers was the cause.
Solution: add the following to the convolution layer parameters.
bias_filler {
type: "constant"
value: 0
}
This answer is not about a cause for nans, but rather proposes a way to help debug it.
You can have this python layer:
class checkFiniteLayer(caffe.Layer):
def setup(self, bottom, top):
self.prefix = self.param_str
def reshape(self, bottom, top):
pass
def forward(self, bottom, top):
for i in xrange(len(bottom)):
isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
if isbad>0:
raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isbad)/bottom[i].count))
def backward(self, top, propagate_down, bottom):
for i in xrange(len(top)):
if not propagate_down[i]:
continue
isf = np.sum(1-np.isfinite(top[i].diff[...]))
if isf>0:
raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isf)/top[i].count))
Adding this layer into your train_val.prototxt at certain points you suspect may cause trouble:
layer {
type: "Python"
name: "check_loss"
bottom: "fc2"
top: "fc2" # "in-place" layer
python_param {
module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
layer: "checkFiniteLayer"
param_str: "prefix-check_loss" # string for printouts
}
}
learning_rate is high and should be decreased
The accuracy in the RNN code was nan, with select the low value for learning rate it fixes
One more solution for anyone stuck like I just was-
I was receiving nan or inf losses on a network I setup with float16 dtype across the layers and input data. After all else failed, it occurred to me to switch back to float32, and the nan losses were solved!
So bottom line, if you switched dtype to float16, change it back to float32.
I was trying to build a sparse autoencoder and had several layers in it to induce sparsity. While running my net, I encountered the NaN's. On removing some of the layers (in my case, I actually had to remove 1), I found that the NaN's disappeared. So, I guess too much sparsity may lead to NaN's as well (some 0/0 computations may have been invoked!?)