I am attempting to implement a Caffe Softmax layer with a "temperature" parameter. I am implementing a network utilizing the distillation technique outlined here.
Essentially, I would like my Softmax layer to utilize the Softmax w/ temperature function as follows:
F(X) = exp(zi(X)/T) / sum(exp(zl(X)/T))
Using this, I want to be able to tweak the temperature T before training. I have found a similar question, but this question is attempting to implement Softmax with temperature on the deploy network. I am struggling to implement the additional Scale layer described as "option 4" in the first answer.
I am using the cifar10_full_train_test prototxt file found in Caffe's examples directory. I have tried making the following change:
Original
...
...
...
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip1"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip1"
bottom: "label"
top: "loss"
}
Modified
...
...
...
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip1"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
type: "Scale"
name: "temperature"
top: "zi/T"
bottom: "ip1"
scale_param {
filler: { type: 'constant' value: 0.025 } ### I wanted T = 40, so 1/40=.025
}
param { lr_mult: 0 decay_mult: 0 }
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip1"
bottom: "label"
top: "loss"
}
After a quick train (5,000 iterations), I checked to see if my classification probabilities are appearing more even, but they actually appeared to be less evenly distributed.
Example:
high temp T: F(X) = [0.2, 0.5, 0.1, 0.2]
low temp T: F(X) = [0.02, 0.95, 0.01, 0.02]
~my attempt: F(X) = [0, 1.0, 0, 0]
Do I appear to be on the right track with this implementation? Either way, what am I missing?
You are not using the "cooled" predictions "zi/T" your "Scale" layer produce.
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "zi/T" # Use the "cooled" predictions instead of the originals.
bottom: "label"
top: "loss"
}
The accepted answer has helped me to understand my misconceptions regarding the Softmax temperature implementation.
As #Shai pointed out, in order to observe the "cooled" probability outputs as I was expecting, the Scale layer must only be added to the "deploy" prototxt file. It is not necessary to include the Scale layer in the train/val prototxt at all. In other words, the temperature must be applied to the Softmax layer, not the SoftmaxWithLoss layer.
If you want to apply the "cooled" effect to your probability vector, simply make sure your last two layers are as such:
deploy.prototxt
layer {
type: "Scale"
name: "temperature"
top: "zi/T"
bottom: "ip1"
scale_param {
filler: { type: 'constant' value: 1/T } ## Replace "1/T" with actual 1/T value
}
param { lr_mult: 0 decay_mult: 0 }
}
layer {
name: "prob"
type: "Softmax"
bottom: "zi/T"
top: "prob"
}
My confusion was due primarily to my misunderstanding of the difference between SoftmaxWithLoss and Softmax.
Related
I am new at Caffe and I want to use already trained caffeNet model with ImageNet. I applied net surgery by removing a convolutional intermediate conv4 layer.
'layer {
name: "relu3"
type: "ReLU"
bottom: "conv3"
top: "conv3"
}
layer {
name: "relu5-new"
type: "ReLU"
bottom: "conv5-new"
top: "conv5-new"
}
layer {
name: "pool5-new"
type: "Pooling"
bottom: "conv5-new"
top: "pool5-new"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "fc6"
type: "InnerProduct"
bottom: "pool5-new"
top: "fc6"
inner_product_param {
num_output: 4096
}
}'
Full of prototxt file can be found here
After saving this new network the accuracy became 0. Should I make fine tuning on ImageNet validation set, or is there something wrong on my new prototxt file?
Any help will be appreciated!
The original net you started with had conv4 between conv3 and conv5: this means the filters (weights) of conv5 were expecting certain number of inputs and certain "order" or "meaning" of inputs. Once you removed conv4, you had to alter conv5 to accept different number of inputs. Therefore, the new conv5 layer must be trained to accommodate to the new inputs it receives.
In this case, when you introduced a new conv5 layer, you should have weight_filler defined in your prototxt to guide caffe as to how to initialize the new weights. Otherwise caffe will set the weights to zero and it will be almost impossible to finetune in this case.
I'm trying to reproduce following thesis with caffe
Deep EXpectation
Last layer has 100 outputs, each layer is implying probability of predicted age. And final predicted age is calculated by following equation:
so I want to make loss using EUCLIDEAN_LOSS with label and Predicted value.
I show my prototxt for last output layer and loss layer.
layer {
bottom: "pool5"
top: "fc100"
name: "fc100"
type: "InnerProduct"
inner_product_param {
num_output: 100
}
}
layer {
bottom: "fc100"
top: "prob"
name: "prob"
type: "Softmax"
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "fc100"
bottom: "label"
top: "loss"
loss_weight: 1
}
Just for now, I am trying these with SoftmaxWithLoss. However, this loss is more appropriate to classification not for regression. How can I design the loss layer for in this case?
Thanks in advance.
TL;DR
I've been through similar task once, and from my experience there was little difference (in terms of output accuracy) between training discrete labels and regressing a single continuous value.
There are several ways you can approach this problem:
1. Regressing a single output
Since you only need to predict a single scalar value, you should train your net to do just so:
layer {
bottom: "pool5"
top: "fc1"
name: "fc1"
type: "InnerProduct"
inner_product_param {
num_output: 1 # predict single output
}
}
You need to make sure the predicted value is in range [0..99]:
layer {
bottom: "fc1"
top: "pred01" # map to [0..1] range
type: "Sigmoid"
name: "pred01"
}
layer {
bottom: "pred01"
top: "pred_age"
type: "Scale"
name: "pred_age"
param { lr_mult: 0 } # do not learn this scale - it is fixed
scale_param {
bias_term: false
filler { type: "constant" value: 99 }
}
}
Once you have the prediction in pred_age you can add a loss layer
layer {
bottom: "pred_age"
bottom: "true_age"
top: "loss"
type: "EuclideanLoss"
name: "loss"
}
Though, I would advice to use "SmoothL1" in this case as it is more robust.
2. Regressing the expectation of the discrete prediction
You can implement your prediction formula in caffe. You need a fixed vector of values [0..99] for that. There are many ways to do that, none is very straight-forward. Here's one way using net-surgery:
First, define the net
layer {
bottom: "prob"
top: "pred_age"
name: "pred_age"
type: "Convolution"
param { lr_mult: 0 } # fixed layer.
convolution_param {
num_output: 1
bias_term: false
}
}
layer {
bottom: "pred_age"
bottom: "true_age"
top: "loss"
type: "EuclideanLoss" # same comment about type of loss as before
name: "loss"
}
You cannot use this net yet, first you need to set the kernel of pred_age layer to 0..99.
In python, load the new
net = caffe.Net('path/to/train_val.prototxt', caffe.TRAIN)
li = list(net._layer_names).index('pred_age') # get layer index
net.layers[li].blobs[0].data[...] = np.arange(100, dtype=np.float32) # set the kernel
net.save('/path/to/init_weights.caffemodel') # save the weights
Now you can train your net, but MAKE SURE you are starting your train from the weights saved in '/path/to/init_weights.caffemodel'.
I have 2 different models, let's say NM1 and NM2.
So, what I'm looking is something that works like in the example below.
Let's say that we have a picture of a dog.
NM1 predicts that it's a cat on the picture with a probability 0.52 and that it's a dog with a probability 0.48.
NM2 predicts that it's a dog with a probability 0.6 and that it's a cat with a probability 0.4.
NM1 - will predict wrong
NM2 - will predict correctly
NM1 + NM2 - connection will predict correctly (because 0.48 + 0.6 > 0.52 + 0.4)
So, each model ends with InnerProducts (after Softmax) which give me 2 vectors of probabilities.
Next step, I have those 2 vectors and I want to add them. Here I use Eltwise layer.
layer {
name: "eltwise-sum"
type: "Eltwise"
bottom: "fc8"
bottom: "fc8N"
top: "out"
eltwise_param { operation: SUM }
}
Before joining NM1 had accuracy ~70% and NM2 ~10%.
After joining accuracy can't reach even 1%.
Thus, my conclusion is that I understand something wrong and I'd be grateful if someone could explain to me where I'm wrong.
PS. I did turn off shuffle when creating lmdb.
UPDATE
layer {
name: "eltwise-sum"
type: "Eltwise"
bottom: "fc8L"
bottom: "fc8NL"
top: "out"
eltwise_param {
operation: SUM
coeff: 0.5
coeff: 0.5
}
}
#accur for PI alone
layer {
name: "accuracyPINorm"
type: "Accuracy"
bottom: "fc8L"
bottom: "label"
top: "accuracyPiNorm"
include {
phase: TEST
}
}
#accur for norm images alone
layer {
name: "accuracyIMGNorm"
type: "Accuracy"
bottom: "fc8NL"
bottom: "labelN"
top: "accuracyIMGNorm"
include {
phase: TEST
}
}
#accur for them together
layer {
name: "accuracy"
type: "Accuracy"
bottom: "out"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
If you want to add (element-wise) the probabilities, you need to add after the "Softmax" layer, and not after the "InnerProduct" layer. You should have something like
layer {
type: "InnerProduct"
name: "fc8"
top: "fc8"
# ...
}
layer {
type: "Softmax"
name: "prob_nm1"
top: "prob_nm1"
bottom: "fc8"
}
layer {
type: "InnerProduct"
name: "fc8N"
top: "fc8N"
# ...
}
layer {
type: "Softmax"
name: "prob_nm2"
top: "prob_nm2"
bottom: "fc8N"
}
# Joining the probabilites
layer {
type: "Eltwise"
name: "prob_sum"
bottom: "prob_nm1"
bottom: "prob_nm2"
top: "prob_sum"
eltwise_param {
operation: SUM
coeff: 0.5
coeff: 0.5
}
}
I have a network which has 4 Boolean outputs. It is not a classification problem and each of them are meaningful. I expect to get a zero or one for each of them. Right now I have used the Euclidean loss function.
There are 1000000 samples. In the input file, each of them have 144 features, so there the size of the input is 1000000*144.
I have used batch size of 50, because otherwise the processing time is too much.
The output file is of the size 1000000*4, i.e. there are four output per each input.
When I am using the accuracy layer, it complains about the dimension of output. It needs just one Boolean output, not four. I think it is because it considers the problem as a classification problem.
I have two questions.
First, considering the error of the accuracy layer, is the Euclidean loss function suitable for this task? And How I can get the accuracy for my network?
Second,I gonna get the exact value of the predicted output for each of the four variable. I mean I need the exact predicted values for each test record. Now, I just have the loss value for each batch.
Please guide me to solve those issues.
Thanks,
Afshin
The train network is:
{ state {
phase: TRAIN
}
layer {
name: "abbas"
type: "HDF5Data"
top: "data"
top: "label"
hdf5_data_param {
source: "/home/afo214/Research/hdf5/simulation/Train-1000-11- 1/Train-Sc-B-1000-11-1.txt"
batch_size: 50
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "data"
top: "ip1"
inner_product_param {
num_output: 350
weight_filler {
type: "xavier"
}
}
}
layer {
name: "sig1"
bottom: "ip1"
top: "sig1"
type: "Sigmoid"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "sig1"
top: "ip2"
inner_product_param {
num_output: 150
weight_filler {
type: "xavier"
}
}
}
The test network is also:
state {
phase: TEST
}
layer {
name: "abbas"
type: "HDF5Data"
top: "data"
top: "label"
hdf5_data_param {
source: "/home/afo214/Research/hdf5/simulation/Train-1000-11- 1/Train-Sc-B-1000-11-1.txt"
batch_size: 50
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "data"
top: "ip1"
inner_product_param {
num_output: 350
weight_filler {
type: "xavier"
}
}
}
layer {
name: "sig1"
bottom: "ip1"
top: "sig1"
type: "Sigmoid"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "sig1"
top: "ip2"
inner_product_param {
num_output: 150
weight_filler {
type: "xavier"
}
}
}
layer {
name: "sig2"
bottom: "ip2"
top: "sig2"
type: "Sigmoid"
}
layer {
name: "ip4"
type: "InnerProduct"
bottom: "sig2"
top: "ip4"
inner_product_param {
num_output: 4
weight_filler {
type: "xavier"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip4"
bottom: "label"
top: "accuracy"
}
layer {
name: "loss"
type: "EuclideanLoss"
bottom: "ip4"
bottom: "label"
top: "loss"
}
And I get this error:
accuracy_layer.cpp:34] Check failed: outer_num_ * inner_num_ == bottom[1]->count() (50 vs. 200) Number of labels must match number of predictions; e.g., if label axis == 1 and prediction shape is (N, C, H, W), label count (number of labels) must be N*H*W, with integer values in {0, 1, ..., C-1}.
Without using the accuracy layer caffe gives me the loss value.
Should "EuclideanLoss" be used for predicting binary outputs?
If you are trying to predict discrete binary labels then "EuclideanLoss" is not a very good choice. This loss is better suited for regression tasks where you wish to predict continuous values (e.g., estimating coordinated of bounding boxes etc.).
For predicting discrete labels, "SoftmaxWithLoss" or "InfogainLoss" are better suited. Usually, "SoftmaxWithLoss" is used.
For predicting binary outputs you may also consider "SigmoidCrossEntropyLoss".
Why is there an error in the "Accuracy" layer?
In caffe, "Accuracy" layers expects two inputs ("bottom"s): one is a prediction vector and the other is the ground truth expected discrete label.
In your case, you need to provide, for each binary output a vector of length 2 with the predicted probabilities of 0 and 1, and a single binary label:
layer {
name: "acc01"
type: "Accuracy"
bottom: "predict01"
bottom: "label01"
top: "acc01"
}
In this example you measure the accuracy for a single binary output. The input "predict01" is a two-vector for each example in the batch (for batch_size: 50 the shape of this blob should be 50-by-2).
What can you do?
You are trying to predict 4 different outputs in a single net, therefore, you need 4 different loss and accuracy layers.
First, you need to split ("Slice") the ground truth labels into 4 scalars (instead of a single binary 4-vector):
layer {
name: "label_split"
bottom: "label" # name of input 4-vector
top: "label01"
top: "label02"
top: "label03"
top: "label04"
type: "Slice"
slice_param {
axis: 1
slice_point: 1
slice_point: 2
slice_point: 3
}
}
Now you have to have a prediction, loss and accuracy layer for each of the binary labels
layer {
name: "predict01"
type: "InnerProduct"
bottom: "sig2"
top: "predict01"
inner_product_param {
num_outout: 2 # because you need to predict 2 probabilities one for False, one for True
...
}
layer {
name: "loss01"
type: "SoftmaxWithLoss"
bottom: "predict01"
bottom: "label01"
top: "loss01"
}
layer {
name: "acc01"
type: "Accuracy"
bottom: "predict01"
bottom: "label01"
top: "acc01"
}
Now you need to replicate these three layer for each of the four binary labels you wish to predict.
i am extracting 30 facial keypoints (x,y) from an input image as per kaggle facialkeypoints competition.
How do i setup caffe to run a regression and produce 30 dimensional output??.
Input: 96x96 image
Output: 30 - (30 dimensions).
How do i setup caffe accordingly?. I am using EUCLIDEAN_LOSS (sum of squares) to get the regressed output. Here is a simple logistic regressor model using caffe but it is not working. Looks accuracy layer cannot handle multi-label output.
I0120 17:51:27.039113 4113 net.cpp:394] accuracy <- label_fkp_1_split_1
I0120 17:51:27.039135 4113 net.cpp:356] accuracy -> accuracy
I0120 17:51:27.039158 4113 net.cpp:96] Setting up accuracy
F0120 17:51:27.039201 4113 accuracy_layer.cpp:26] Check failed: bottom[1]->channels() == 1 (30 vs. 1)
*** Check failure stack trace: ***
# 0x7f7c2711bdaa (unknown)
# 0x7f7c2711bce4 (unknown)
# 0x7f7c2711b6e6 (unknown)
Here is the layer file:
name: "LogReg"
layers {
name: "fkp"
top: "data"
top: "label"
type: HDF5_DATA
hdf5_data_param {
source: "train.txt"
batch_size: 100
}
include: { phase: TRAIN }
}
layers {
name: "fkp"
type: HDF5_DATA
top: "data"
top: "label"
hdf5_data_param {
source: "test.txt"
batch_size: 100
}
include: { phase: TEST }
}
layers {
name: "ip"
type: INNER_PRODUCT
bottom: "data"
top: "ip"
inner_product_param {
num_output: 30
}
}
layers {
name: "loss"
type: EUCLIDEAN_LOSS
bottom: "ip"
bottom: "label"
top: "loss"
}
layers {
name: "accuracy"
type: ACCURACY
bottom: "ip"
bottom: "label"
top: "accuracy"
include: { phase: TEST }
}
i found it :)
I replaced the SOFTLAYER to EUCLIDEAN_LOSS function and changed the number of outputs. It worked.
layers {
name: "loss"
type: EUCLIDEAN_LOSS
bottom: "ip1"
bottom: "label"
top: "loss"
}
HINGE_LOSS is also another option.