in caffe prototxt file. what does the TRAIN and TEST phase do? - machine-learning

I'm new to caffe. thank you guys!
in https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto
I saw 1 uncommented enum variable phase. it has 2 option TRAIN and TEST.
enum Phase {
TRAIN = 0;
TEST = 1;
}
how did they work? I saw a model recently has this 2 phase too. the .prototxt file looks like:
name: "CIFAR10_full"
layer {
name: "cifar"
type: "Data"
top: "data"
top: "label"
data_param {
source: "CIFAR-10/cifar10_train_lmdb"
backend: LMDB
batch_size: 200
}
transform_param {
mirror: true
}
include: { phase: TRAIN }
}
layer {
name: "cifar"
type: "Data"
top: "data"
top: "label"
data_param {
source: "CIFAR-10/cifar10_test_lmdb"
backend: LMDB
batch_size: 100
}
transform_param {
mirror: false
}
include: { phase: TEST }
}
can I switch from TRAIN phase to TEST phase? where is the switch?

during training (i.e., execution of $CAFFE_ROOT/tools/caffe train [...]) caffe can alternate between training phases, and testing phases: that is, during training phase parameters are changed, while in the test phase, the parameters are fixed and the model only runs feed forward examples to estimate the current performance of the model.
It is quite natural to use two different data sets for training and testing, and this is why you use the different phase values.
You can read more about the train/test iterations here.

TRAIN specifies a layer for the model used during training.
TEST specifies a layer for the model used during testing.
Thus, you can define 2 models in the single prototxt file: one model for training and one model for testing.
Info on this can be found in the Model Definition section of the web page http://caffe.berkeleyvision.org/gathered/examples/imagenet.html

Related

Interpret Caffe FCN output classes

I have set up Caffe and using FCN-8s model with little change with output classes:
layer {
name: "score_5classes"
type: "Convolution"
bottom: "score"
top: "score_5classes"
convolution_param {
num_output: 2
pad: 0
kernel_size: 1
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "score_5classes"
bottom: "label"
top: "loss"
loss_param {
normalize: true
}
}
I have changed last layer output number to 2, because I want to classify my input images into 2 classes, 0 and 1 (So it seems I should have 2 outputs! I cant understand why?! It could be an output matrix with zeros and ones, couldnt it?)
So my questions are:
1.Should I sum these 2 classes ? because I need 1 output
2.The loss is so small! even when the output is far away from the desired! how Caffe calculates the lost layer?
Thanks
When doing binary classification, using "SoftmaxWithLoss" with two outputs, is mathematically equivalent to using "SigmoidCrossEntropyLoss". So, if you really only need one output you can set your last layer to num_output: 1 and use "SigmoidCrossEntropyLoss". However, if you want to take advantage of caffe's "Accuracy" layer, you need to use two outputs and "SoftmaxWithLoss" layer.
Regarding your questions:
1. If you opt to use "SoftmaxWithLoss" and you only need one output, take the second output for each pixel as this entry represents the probability of class 1.
I'll leave it to you as an exercise to figure out what you'll get if you take the sum (hint: `"Softmax" output probabilities...)
2. The loss is very small most likely because you have severe class imbalance - most of your pixels are 0 while only very few are 1 (or vice versa), therefore predicting always 0 does not incur such great penalty. If this is your case, I suggest looking at Focal Loss that addresses this issue.

forward network in CNN

I am new to deep learning. I found there are two prototxt files when I used caffe, one is "deploy" and another is "train_val".
I know that "train_val" is used to train the model. But for the "deploy" file some people said it is for test the image.
So, my question is does the "deploy" only have forward() network so the test image data only go through the forward network for once to get the score?
As you already noted there are some fundamental differences between 'train_val.prototxt' and 'deploy.prototxt'.
One key difference is that 'deploy.prototxt' usually lack any loss layer.
When you have no loss function defined for a net, there is no meaning of backward propagation: what gradients would you propagate? gradients of what function?
Therefore, a net object in caffe has backward() method implemented for all phases. Nevertheless, this method is meaningless when you test the net with no loss function (only prediction).
ideally that is how it should work,but the files are just network definition.You can use one single file to both train and test.You have to specify what phase you you want some blobs to be availables,meaning you can definite two inputData layer ,one that would be used during training,and another used for testing and specify the corresponding phase like this:
name: "MyModel"
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: false
crop_size: 227
mean_file: "data/train_mean.binaryproto" # location of the training data mean
}
data_param {
source: "data/train_lmdb" # location of the training samples
batch_size: 128 # how many samples are grouped into one mini-batch
backend: LMDB
}
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
During training, the previous layer will be used where the the second will be ignored.
During test phase,the first layer will be ignored the second layer will be used as input for testing.
Another point is that during testing ,we need the accuracy of our prediction as we don't need to update our weights anymore
l
layer {
name: "accuracy"
type: "Accuracy"
bottom: "fc8"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "fc8"
bottom: "label"
top: "loss"
}
if the include directive is not given,the layer is included in all phases.
Although you can include the accuracy layer also during training to see how the output are goings(ie: for measuring accuracy improvement after how many iterations),we need it more on predictions .
in your solver,you can specify test_iter to specify after how many iteration test operation will be carried out(You validate your model each test_iter iterations)
train_val and deploy file separate those two phases into two different files.All specification in train_val are related to the training phase. and deploy for testing.I am not sure,where the train_val combination came from,but i suppose it was due to the fact that you can validate your model after test_iter and continue to train again from there.
As you dont need the loss during test,rather than the probability,you can use softmax for probability out function in stead of of softmaxwithloss in deploy or you can have both defined.
The caffe test command performs forward operation but doesn't do the backward()(back propagation) operation.I hope it helps

Perform the validation loss from .caffemodel?

AFAIK, we have two ways to obtain the validation loss.
(1) online during training process by setting the solver as follows:
train_net: 'train.prototxt'
test_net: "test.prototxt"
test_iter: 200
test_interval: 100
(2) offline based on the weight in the .caffemodel file. In this question, I regard to the second way due to limited GPU. First, I saved the weight of network to .caffemodel after each 100 iterations by snapshot: 100. Based on these .caffemodel, I want to calculate the validation loss
../build/tools/caffe test -model ./test.prototxt -weights $snapshot -iterations 10 -gpu 0
where snapshot is file name of .caffemodel. For example snap_network_100.caffemodel
And the data layer of my test prototxt is
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TEST
}
hdf5_data_param {
source: "./list.txt"
batch_size: 8
shuffle: true
}
}
The first and the second ways give different validation loss. I found that the first way the validation loss independent of batch size. It means the validation loss is same with different batch size. While, the second way, the validation loss changed with different batch size but the loss is very close together with different iterations.
My question is that which way is correct to compute validation loss?
You compute the validation loss for different number of iterations:
test_iter: 200
In your 'solver.prototxt', vs. -iterations 10 when running from command line. This means you are averaging the loss over different number of validation samples.
Since you are using far less samples when validating from command line, you are much more sensitive to batch_size.
Make sure you are using exactly the same settings and verify that the validation loss is indeed the same.

NVIDIA DIGITS - Multiple Input layers

i have a question regarding NVIDIA DIGITS Framework.
So i have been using caffe without DIGITS and have used HDF5 Layers so far. There i could use multiple "top" (data_0, data_1, data_2) inputs (see code below). So i could feed the net with more then one input image. But in DIGITS only lmdb input layer works.
So is it possible to create a lmdb input layer with multiple input images??
layer {
name: "data"
type: "HDF5Data"
top: "data_0"
top: "data_1"
top: "data_2"
top: "label"
hdf5_data_param {
source: "train.txt"
batch_size: 64
shuffle: true
}
}
Sorry, that isn't supported in DIGITS.
Since DIGITS manages your datasets for you, it also sets up the data layers in your network for you. This way you don't need to copy+paste the LMDB paths into your network when you want to run a previous network on a new dataset or when you move the location of your jobs on disk. It's a decision which sacrifices flexibility for the sake of making the common case easy.
For classification, one LMDB should have two tops: "data" and "label". For other dataset types, there should be one LMDB with a single "data" top, and another LMDB with a single "label" top. If you need a more complicated data layer setup, then you'll need to either use Caffe directly or make some changes to the DIGITS source code.
DIGITS's HDF5 support is not great because Caffe's HDF5 support is not great.

Testing a regression network in caffe

I am trying to count objects in an image using Alexnet.
I have currently images containing 1, 2, 3 or 4 objects per image. For initial checkup, I have 10 images per class. For example in training set I have:
image label
image1 1
image2 1
image3 1
...
image39 4
image40 4
I used imagenet create script to create a lmdb file for this dataset. Which successfully converted my set of images to lmdb.
Alexnet, as an example is converted to a regression model for learning the number of objects in the image by introducing EucledeanLosslayer instead of Softmax Layer. As suggested by many. The rest of the network is the same.
However, despite doing all the above, when I run the model, I received only zeros as output during testing phase(shown below). It did not learn any thing. However, the training loss decreased continuously in each iteration.
I don't understand what mistakes I have made. Can anybody guide me why the predicted values are always 0? And how can I check the regressed values in testing phase, so that to check how many samples are correct and what's the value for each of my image?
The predicted and the actual label of the test dataset is given as :
I0928 17:52:45.585160 18302 solver.cpp:243] Iteration 1880, loss = 0.60498
I0928 17:52:45.585212 18302 solver.cpp:259] Train net output #0: loss = 0.60498 (* 1 = 0.60498 loss)
I0928 17:52:45.585225 18302 solver.cpp:592] Iteration 1880, lr = 1e-06
I0928 17:52:48.397922 18302 solver.cpp:347] Iteration 1900, Testing net (#0)
I0928 17:52:48.499543 18302 accuracy_layer.cpp:88] Predicted_Value: 0 Actual Label: 1
I0928 17:52:48.499641 18302 accuracy_layer.cpp:88] Predicted_Value: 0 Actual Label: 2
I0928 17:52:48.499660 18302 accuracy_layer.cpp:88] Predicted_Value: 0 Actual Label: 3
I0928 17:52:48.499681 18302 accuracy_layer.cpp:88] Predicted_Value: 0 Actual Label: 4
...
Note: I also created hdf5 format files in-order to have floating labels, i.e. 1.0, 2.0, 3.0 and 4.0. However, when i changed data layer to HDF5 type, i can not crop the image for data-augmentation as being done in alexnet with lmdb layer, as well as normalization. I used the script given on "https://github.com/nikogamulin/caffe-utils/blob/master/hdf5/demo.m" for hdf5 data and followed his steps for using it in my model.
I have updated last layers as such:
layer {
name: "fc8reg"
type: "InnerProduct"
bottom: "fc7"
top: "fc8reg"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "fc8reg"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "EuclideanLoss"
bottom: "fc8reg"
bottom: "label"
top: "loss"
}
Without judging whether your network diverged or not, the obvious mistake you have made is that you shouldn't use a Accuracy layer to test a regression network. It is only for testing a classification network trained by a SoftmaxWithLoss Layer.
In fact, given an image for a network, the Accuracy layer in the network will always sort its input array(here it is bottom: "fc8reg") and choose the index of the maximal value in the array as the predicted label by default.
Since num_output == 1 in fc8reg layer, accuracy layer will always predict index 0 for the input image as its predicted label as you have seen.
At last, you can use a EuclideanLoss layer to test your regression network. This similar problem may also give you some hint.
If you are to print and calculate the regressed values after training, and count the accuracy of the regression network, you can simply write a RegressionAccuracy layer like this.
Or, if your target label only has 4 discrete values {1,2,3,4}, you can still train a classification network for your task.
In my opinion, everything is correct, but your network is not converging, which is not a rare hapenning. Your network is actually converging to zero outputs!
Maybe most of your samples have 0 as their label.
Also don't forget to include the loss layer only during TRAIN; otherwise, it will learn on test data as wel.

Resources