Training caffe library and Loss does not converge - machine-learning

I use caffe for my recognition and I have an issue that loss data never converge.
My training parameters in the configuration are
Conf.base_lr = 0.2;
Conf.max_iter = 800001;
Conf.test_iter = 100;
Conf.test_interval = 1000;
Conf.momentum = 0.5;
Conf.random_seed = 2;
Conf.clip_gradients = 0.1;
Conf.gamma = 0.8;
Conf.stepsize = 100000;
Conf.weights = "";
Conf.display_interval = 100;
Conf.snapshot_prefix_folder = "../tmp";
Conf.snapshot_interval = 10000;
Conf.schematic_path = "../tmp/reinspect.png";
Conf.graph_prefix = "../tmp/history";
Conf.log_file = "../tmp/log_brainwash.txt";
Conf.graph_interval = 500;
Conf.init_range = 0.1;
Then when I check Backward data
All net params (data, diff): L1 norm = (208684, 3.43485e+11); L2 norm = (135.231, 3.96399e+08)
Diff values of L1 and L2 norms are huge and not normal. What could be wrong with my parameters in configuration and how to tune them?
Some of my log data for forward and backward can be seen in this link.
Previously I have problem of some layers are not included in the Backward propagation. So now I force them and all are included except these with no bottom layers like Input and DummyData. They are shown below.
This development is similar implementation to this Lib (only Python and C++ are different). They include all those layers in Backward propagation, DummyData is NumpyData in Python in their implementation. If necessary, how to include those into Backward propagation.
layer {
name: "image"
type: "Input"
top: "image"
input_param { shape: { dim: 1 dim: 3 dim: 480 dim: 640 } }
layer {
name: "box_flags"
type: "Input"
top: "box_flags"
input_param { shape: { dim: 300 dim: 1 dim: 5 dim: 1 } }
layer {
name: "boxes"
type: "Input"
top: "boxes"
input_param { shape: { dim: 300 dim: 4 dim: 5 dim: 1 } }
layer {
name: "lstm_hidden_seed"
type: "DummyData"
top: "lstm_hidden_seed"
dummy_data_param {
shape { dim: 300 dim: 250 }
data_filler { type: "constant" value: 0 }
layer {
name: "lstm_mem_seed"
type: "DummyData"
top: "lstm_mem_seed"
dummy_data_param {
shape { dim: 300 dim: 250 }
data_filler { type: "constant" value: 0 }
DummyData layer was NumpyData when it was in Python, when I convert to C++, it is changed to DummyData with initialization data 0.
Do I need to include all those Input and DummyData into Backward propagation?
I still have this abnormal big values at L1 and L2 norm.
[Backward] All net params (data, diff): L1 norm = (208696, 4.09333e+06); L2 norm = (135.23, 4791.7)

Your learning rate is very high, making your optimization process diverge. Try reduce it by factor of at least 50 and re-start the training.


Label smoothing in caffe with prototxt without data regeneration

I've got a huge data set in LMDB (40Gb) that I use for training a binary classifier with caffe.
Data layer in Caffe contains integer labels.
Are there any ready layers that could transform them into floats with adding some random jitter, so I could apply label smoothing technique, as described in 7.5.1 here
I have seen examples with HDF5, but they require regenerating data set, and I would like to avoid it.
You can use DummyData layer to generate the random noise you wish to add to the labels. Once you have the noise, use Eltwise layer to sum them up:
layer {
name: "noise"
type: "DummyData"
top: "noise"
dummy_data_param {
shape { dim: 10 dim: 1 dim: 1 dim: 1 } # assuming batch size = 10
data_filler { type: "uniform" min: -0.1 max: 0.1 } # noise ~U(-0.1, 0.1)
layer {
name: "label_noise"
type: "Eltwise"
bottom: "label" # the input integer labels
bottom: "noise"
top: "label_noise"
eltwise_param { operation: SUM }

I can't get Caffe working

After some struggling, I decided to try a most simple task, training a network to classify weither a number is non-negtive. And I failed...
I generated the data with following code. And I'm not sure if it is right. I read the data back from the file, and it looked right, though...
#pragma comment(lib, "hdf5")
#pragma comment(lib, "hdf5_cpp")
#include <cstdint>
#include <array>
#include <random>
#include <vector>
using namespace std;
#include <H5Cpp.h>
using namespace H5;
mt19937 rng;
float randf(float i_min, float i_max)
return rng() * ((i_max - i_min) / 0x100000000) + i_min;
#define NAME "pos_neg"
#define TRAIN_SET_SIZE 0x100000
#define TEST_SET_SIZE 0x10000
void make(const string &i_cat, uint32_t i_count)
H5File file(NAME "." + i_cat + ".h5", H5F_ACC_TRUNC);
hsize_t dataDim[2] = { i_count, 1 };
hsize_t labelDim = i_count;
FloatType dataType(PredType::NATIVE_FLOAT);
DataSpace dataSpace(2, dataDim);
DataSet dataSet = file.createDataSet("data", dataType, dataSpace);
IntType labelType(PredType::NATIVE_INT);
DataSpace labelSpace(1, &labelDim);
DataSet labelSet = file.createDataSet("label", labelType, labelSpace);
vector<float> data(i_count);
vector<int> labels(i_count);
for (uint32_t i = 0; i < i_count / 2; ++i)
labels[i * 2] = 0;
data[i * 2] = randf(0.f, 1.f);
labels[i * 2 + 1] = 1;
data[i * 2 + 1] = randf(-1.f, 0.f);
dataSet.write(&data[0], PredType::NATIVE_FLOAT);
labelSet.write(&labels[0], PredType::NATIVE_INT);
int main()
make("train", TRAIN_SET_SIZE);
make("test", TEST_SET_SIZE);
And the network looks like this
name: "PosNegNet"
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TRAIN
hdf5_data_param {
source: "pos_neg_train.txt"
batch_size: 64
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TEST
hdf5_data_param {
source: "pos_neg_test.txt"
batch_size: 65536
layer {
name: "fc1"
type: "InnerProduct"
bottom: "data"
top: "fc1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
inner_product_param {
num_output: 1
weight_filler {
type: "xavier"
bias_filler {
type: "constant"
value: 0
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "fc1"
bottom: "label"
top: "loss"
layer {
name: "accuracy"
type: "Accuracy"
bottom: "fc1"
bottom: "label"
top: "accuracy"
include {
phase: TEST
And and one set of parameters I tried
net: "pos_neg.prototxt"
test_iter: 1
test_interval: 500
base_lr: 0.001
momentum: 0.9
momentum2: 0.999
lr_policy: "fixed"
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "pos_neg"
type: "Adam"
solver_mode: GPU
And I ran caffe.exe on Windows. And I always got loss = 0, accuracy = 0.5.
I know I must have done something wrong, but I don't know from where to look, well, other than digging up source code...
And I found that caffe is fairly slow. I got only around 16 iterations per second for a float[64] data with 1024 item per batch on a 1080Ti. Was it normal or I did something wrong again?
Set num_output: 2 in your "fc1": when using "SoftmaxWithLoss" and/or "Accuracy" layers caffe expects your prediction to be a vector of class probabilities. In your case, you have two classes, thus this vector should be of length 2 (and not 1 as it currently stands).
Alternatively, you can keep num_output: 1 and switch the loss to "SigmoidCrossEntropyLoss" layer. However, you will not be able to use "Accuracy" layer anymore...

FCN:Check failed: outer_num_ * inner_num_ == bottom[1]->count()

I design a net the same as FCN.Input data is 1*224*224,Input label is 1*224*224.but I meet error:
F0502 07:57:30.032742 18127 softmax_loss_layer.cpp:47] Check failed: outer_num_ * inner_num_ == bottom[1]->count() (50176 vs. 1) Number of labels must match number of predictions; e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), label count (number of labels) must be N*H*W, with integer values in {0, 1, ..., C-1}.
here is the input structure:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
image_data_param {
ource: "/home/zhaimo/fcn-master/mo/train.txt"
batch_size: 1
shuffle: true
the softmax layers:
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "upscore1"
bottom: "label"
top: "loss"
loss_param {
ignore_label: 255
normalize: false
the train.txt file:
/home/zhaimo/fcn-master/data/vessel/train/original/01.png /home/zhaimo/SegNet/data/vessel/train/label/01.png
/home/zhaimo/fcn-master/data/vessel/train/original/02.png /home/zhaimo/SegNet/data/vessel/train/label/02.png
/home/zhaimo/fcn-master/data/vessel/train/original/03.png /home/zhaimo/SegNet/data/vessel/train/label/03.png
/home/zhaimo/fcn-master/data/vessel/train/original/04.png /home/zhaimo/SegNet/data/vessel/train/label/04.png
the first file name is input data and the second one is its label.
I tried to use two ImageData layer as input:
layer {
name: "data"
type: "ImageData"
top: "data"
image_data_param {
source: "/home/zhaimo/fcn-master/mo/train_o.txt"
batch_size: 1
shuffle: false
layer {
name: "label"
type: "ImageData"
top: "label"
image_data_param {
source: "/home/zhaimo/fcn-master/mo/train_l.txt"
batch_size: 1
shuffle: false
but meet another error:
I0502 08:34:46.429774 19100 layer_factory.hpp:77] Creating layer data
I0502 08:34:46.429808 19100 net.cpp:100] Creating Layer data
I0502 08:34:46.429816 19100 net.cpp:408] data -> data
F0502 08:34:46.429834 19100 layer.hpp:389] Check failed: ExactNumTopBlobs() == top.size() (2 vs. 1) ImageData Layer produces 2 top blob(s) as output.
*** Check failure stack trace: ***
Aborted (core dumped)
if I use two ImageData layers,how to modify the deploy.prototxt?
here is the file I wrote:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "tmp0"
input_param { shape: { dim: 1 dim: 1 dim: 224 dim: 224 } }
and the file:
import numpy as np
from PIL import Image
caffe_root = '/home/zhaimo/'
import sys
sys.path.insert(0, caffe_root + 'caffe-master/python')
import caffe
# load image, switch to BGR, subtract mean, and make dims C x H x W for Caffe
im ='/home/zhaimo/fcn-master/data/vessel/test/13.png')
in_ = np.array(im, dtype=np.float32)
#in_ = in_[:,:,::-1]
#in_ -= np.array((104.00698793,116.66876762,122.67891434))
#in_ = in_.transpose((2,0,1))
# load net
net = caffe.Net('/home/zhaimo/fcn-master/mo/deploy.prototxt', '/home/zhaimo/fcn-master/mo/snapshot/train/_iter_200000.caffemodel', caffe.TEST)
# shape for input (data blob is N x C x H x W), set data
net.blobs['data'].reshape(1, *in_.shape)
net.blobs['data'].data[...] = in_
# run net and take argmax for prediction
out = net.blobs['score'].data[0].argmax(axis=0)
but I meet the error:
F0504 08:16:46.423981 3383 layer.hpp:389] Check failed: ExactNumTopBlobs() == top.size() (2 vs. 1) ImageData Layer produces 2 top blob(s) as output.
how to modify the file,please?
Your problem is with the data top blob numbers. For two imagedata layer use this:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "tmp"
image_data_param {
source: "/home/zhaimo/fcn-master/mo/train_o.txt"
batch_size: 1
shuffle: false
layer {
name: "label"
type: "ImageData"
top: "label"
top: "tmp1"
image_data_param {
// you probably also need
//is_color: false
source: "/home/zhaimo/fcn-master/mo/train_l.txt"
batch_size: 1
shuffle: false
In the text file just set all label to 0. You are not going to use tmp/tmp1 so it doesn't matter.

tensorflow conv2d unexpected convolution result

I try to migrate a Caffe network and model(weights) to tensorflow.
The original first layer is defined as shown at last, which is a stride one convolution on 1x128x128 gray image with kernel size 5x5, output channel 96.
I converted the weights from caffemodel file to numpy array following this procedure:
net = caffe.Net(model, caffe.TEST);
weights = net.params[name][0].data
bais = net.params[name][1].data
if "fc" in name:
weights = weights.transpose()#2D
elif "conv" in name:
weights = weights.transpose(2, 3, 1, 0)
Caffe weights shape:(96, 1, 5, 5),biases shape:(96,). After the transpose, new array of 'weights shape:', (5, 5, 1, 96), 'biases shape:', (96,), are used to initialize tensorflow filter.
tensorflow code is as followed:
gray = tf.reduce_mean(images, axis=3, keep_dims=True)
self.gray = gray
conv1 = self._conv_layer(gray, name='conv1')
def _conv_layer(self, input_, output_dim=96,
k_h=3, k_w=3, d_h=1, d_w=1, stddev=0.02,
#Note: currently kernel size and input output channel num are decided by loaded filter weights.
#only strides are decided by calling param.
with tf.variable_scope(name) as scope:
filt = self.get_conv_filter(name)
conv = tf.nn.conv2d(input_, filt, strides=[1, d_h, d_w, 1], padding='SAME')
conv_biases = self.get_bias(name)
return tf.nn.bias_add(conv, conv_biases)
def get_conv_filter(self, name):
init = tf.constant_initializer(value=weights,
shape = weights.shape
var = tf.get_variable(name="filter", initializer=init, shape=shape)
return var
I checked the input data of Caffe net and tensorflow's tensor gray, they are the same numbers with the same 2D layout. (1,1,128,128) and (10, 128, 128, 1), tensorflow use a batch size of 10.
I also checked the kernel through Caffe's print(net.blobs['conv1'].data[0,0,...]) and the numpy array used to initalize tensorflow var with print(weights[:,:,:,0]).
the kernel's first layer screen shot is shown below:
the bias is -0.65039569 and the upper left corner of the image is:
0.30989584 0.30989584 0.29427084 0.21354167 0.16145833
0.30989584 0.30989584 0.29427084 0.21354167 0.16145833
0.28645834 0.28645834 0.27083334 0.19010417 0.09114584
However, the two's upper left corner of conv1's first feature map are different.(please ignore the irrrelevant 256)
Only the leftmost column is consistent. I manually calculated and checked the results, the first and the second value of Caffe's (-0.71238005 -0.74042225) are correct according to the definition of convolution, the second value in tensorflow's (-0.71238005 -0.31195271) is incorrect.
Taking into account the padding, the first value is from 3x3 block of the image, the second should be the 3x4 block.
Since tensorflow has the correct first value, computed from the 3x3 block of image corner, I assume the kernel layout and image layout and 'SAME' padding are correct. I thought it was a problem with stride that caused the incorrect second value, but the stride must be one, otherwise tensorflow's conv1 feature map's size won't be (10, 128, 128, 96).
Caffe's convolution layer def:
input_param {
shape: {
dim: 10
dim: 1
dim: 128
dim: 128
transform_param {
crop_size: 128
mirror: false
name: "conv1"
type: "Convolution"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 96
kernel_size: 5
stride: 1
pad: 2
weight_filler {
type: "xavier"
bias_filler {
type: "constant"
value: 0.1
bottom: "data"
top: "conv1"
Another contrived experiment(see code bolow) shows the tensorflow implementation is able to compute the correct second value. However, the error remains in the above situation. What is it that caused the error in the converted version?
input = np.random.rand(100,100)
input = input.reshape([1,100,100,1])
k = np.random.rand(5,5)
k = k.reshape([5,5,1,1])
input_tf = tf.constant(input,dtype=tf.float32)
init = tf.constant_initializer(value=k,
filter = tf.get_variable(name="filter", initializer=init, shape=k.shape)
conv = tf.nn.conv2d(input_tf, filter, strides=[1,1,1,1], padding='SAME')

caffe: 5D blobs pooling?

I have a 5D blob like 1x8x128x128 and I have a Convolution layer which is able to process my 5D blob. When I want to use a pool layer though it does not work. How do you use a pool-layer with a 5D blob?
Check failed: 4 == bottom[0]->num_axes() (4 vs. 5) Input must have 4
axes, corresponding to (num, channels, height, width)
I think it is just not supported yet by caffe. Could I just use a convolution layer and do the pooling?
If you want to pool only the first two spatial dimensions, you can "Reshape" to 4D ("squashing" the channel and temporal dimensions), pool and then "Reshape" back to 5D:
layer {
name: "pool/reshape4D"
type: "Reshape"
bottom: "in"
top: "pool/reshape4D"
reshape_param { axis: 1 num_axes: 1 shape { dim: -1 } }
layer {
name: "pool"
type: "Pooling"
bottom: "pool/reshape4D"
top: "pool"
# pooling params here...
layer {
name: "pool/reshape5D"
type: "Reshape"
bottom: "pool"
top: "pool/reshape5D"
reshape_param { axis: 1 num_axes: 1 shape { dim: -1 dim: <temporal_dim> } } # replace <.> with the actual temporal dimension size.
See the definition of ReshapeParameter in caffe.proto for more details.
