I am using Caffe to train AlexNet on a known image database. I am benchmarking and want to exclude a testing phase.
Here is the solver.prototxt for AlexNet:
net: "models/bvlc_alexnet/train_val.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 450000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "models/bvlc_alexnet/caffe_alexnet_train"
solver_mode: GPU
While I have never found a definitive doc that detailed all of the prototxt options, comments within Caffe tutorials indicate this "test_interval" represents the number of iterations after which we test the trained network.
I figured that I might set it to zero to turn off testing. Nope.
F1124 14:42:54.691428 18772 solver.cpp:140] Check failed: param_.test_interval() > 0 (0 vs. 0)
*** Check failure stack trace: ***
So I set the test_interval to one million, but still of course, Caffe tests the network at iteration zero.
I1124 14:59:12.787899 18905 solver.cpp:340] Iteration 0, Testing net (#0)
I1124 14:59:15.698724 18905 solver.cpp:408] Test net output #0: accuracy = 0.003
How do I turn testing off while training?
Caffe's documentation is somewhat scant on details. What I was finally told is this counterintuitive solution:
In your solver.prototxt, take the lines for test_iter and test_interval
test_iter: 1000
test_interval: 1000
and simply omit them. If you'd like to prevent the test at the beginning, you would add a line as #shai suggested:
test_initialization: false
You have a flag for that too. Add
test_initialization: false
To your 'solver.prototxt' and you are done ;)
Related
I have a Caffe prototxt as follows:
stepsize: 20000
iter_size: 4
batch_size: 10
gamma =0.1
in which, the dataset has 40.000 images. It means after 20000 iters, the learning rate will decrease 10 times. In pytorch, I want to compute the number of the epoch to have the same behavior in caffe (for learning rate). How many epoch should I use to decrease learning rate 10 times (note that, we have iter_size=4 and batch_size=10). Thanks
Ref: Epoch vs Iteration when training neural networks
My answer: Example: if you have 40000 training examples, and batch size is 10, then it will take 40000/10 =4000 iterations to complete 1 epoch. Hence, 20000 iters to reduce learning rate in caffe will same as 5 epochs in pytorch.
You did not take into account iter_size: 4: when batch is too large to fit into memory, you can "split" it into several iterations.
In your example, the actual batch size is batch_sizexiter_size=10 * 4 = 40. Therefore, an epoch takes only 1,000 iterations and therefore you need to decrease the learning rate after 20 epochs.
AFAIK, we have two ways to obtain the validation loss.
(1) online during training process by setting the solver as follows:
train_net: 'train.prototxt'
test_net: "test.prototxt"
test_iter: 200
test_interval: 100
(2) offline based on the weight in the .caffemodel file. In this question, I regard to the second way due to limited GPU. First, I saved the weight of network to .caffemodel after each 100 iterations by snapshot: 100. Based on these .caffemodel, I want to calculate the validation loss
../build/tools/caffe test -model ./test.prototxt -weights $snapshot -iterations 10 -gpu 0
where snapshot is file name of .caffemodel. For example snap_network_100.caffemodel
And the data layer of my test prototxt is
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TEST
}
hdf5_data_param {
source: "./list.txt"
batch_size: 8
shuffle: true
}
}
The first and the second ways give different validation loss. I found that the first way the validation loss independent of batch size. It means the validation loss is same with different batch size. While, the second way, the validation loss changed with different batch size but the loss is very close together with different iterations.
My question is that which way is correct to compute validation loss?
You compute the validation loss for different number of iterations:
test_iter: 200
In your 'solver.prototxt', vs. -iterations 10 when running from command line. This means you are averaging the loss over different number of validation samples.
Since you are using far less samples when validating from command line, you are much more sensitive to batch_size.
Make sure you are using exactly the same settings and verify that the validation loss is indeed the same.
I'm trying to build a CNN to classify dogs.In fact , my data set consists of 5 classes of dogs. I've 50 images of dogs splitted into 40 images for training and 10 for testing.
I've trained my network based on AlexNet pretrained model over 100,000 and 140,000 iterations but the accuracy is always between 20 % and 30 %.
In fact, I have adapted AlexNet to my problem as following : I changed the name of last fully connected network and num_output to 5. Also , I ve changed the name of the first fully connected layer (fc6).
So why this model failed even I' ve used data augmentation (cropping )?
Should I use a linear classification on top layer of my network since I have a little bit of data and similar to AlexNet dataset ( as mentioned here transfer learning) or my data set is very different of original data set of AlexNet and I should train linear classifier in earlier network ?
Here is my solver :
net: "models/mymodel/train_val.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 200000
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "models/mymodel/my_model_alex_net_train"
solver_mode: GPU
Although you haven't given us much debugging information, I suspect that you've done some serious over-fitting. In general, a model's "sweet spot" for training is dependent on epochs, not iterations. Single-node AlexNet and GoogleNet, on an ILSVRC-style of data base, train in 50-90 epochs. Even if your batch size is as small as 1, you've trained for 2500 epochs with only 5 classes. With only 8 images per class, the AlexNet topology is serious overkill and is likely adapted to each individual photo.
Consider this: you have only 40 training photos, but 96 kernels in the first convolution layer and 256 in the second. This means that your model can spend over 2 kernels in conv1 and 6 in conv 2 for each photograph! You get no commonality of features, no averaging ... instead of edge detection generalizing to finding faces, you're going to have dedicated filters tuned to the individual photos.
In short, your model is trained to find "Aunt Polly's dog on a green throw rug in front of the kitchen cabinet with a patch of sun to the left." It doesn't have to learn to discriminate a basenji from a basset, just to recognize whatever is randomly convenient in each photo.
I am training AlexNet on my own data using caffe. One of the issues I see is that the "Train net output" loss and "iteration loss" are nearly the same in the training process. Moreover, this loss has fluctuations.
like:
...
...Iteration 900, loss 0.649719
... Train net output #0: loss = 0.649719 (* 1 = 0.649719 loss )
... Iteration 900, lr = 0.001
...Iteration 1000, loss 0.892498
... Train net output #0: loss = 0.892498 (* 1 = 0.892498 loss )
... Iteration 1000, lr = 0.001
...Iteration 1100, loss 0.550938
... Train net output #0: loss = 0.550944 (* 1 = 0.550944 loss )
... Iteration 1100, lr = 0.001
...
should I see this fluctuation?
As you see the difference between reported losses are not significant. Does it show a problem in my training?
my solver is:
net: "/train_val.prototxt"
test_iter: 1999
test_interval: 10441
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 100
max_iter: 208820
momentum: 0.9
weight_decay: 0.0005
snapshot: 10441
snapshot_prefix: "/caffe_alexnet_train"
solver_mode: GPU
Caffe uses Stochastic Gradient Descent (SGD) method for training the net. In the long run, the loss decreases, however, locally, it is perfectly normal for the loss to fluctuate a bit.
The reported "iteration loss" is the weighted sum of all loss layers of your net, averaged over average_loss iterations. On the other hand, the reported "train net output..." reports each net output from the current iteration only.
In your example, you did not set average_loss in your 'solver', and thus average_loss=1 by default. Since you only have one loss output with loss_weight=1 the reported "train net output..." and "iteration loss" are the same (up to display precision).
To conclude: your output is perfectly normal.
I train my data set using caffe. I set (in slover.prototxt):
test_iter: 1000
test interval: 1000
max_iter: 450000
base_lr: 0.0001
lr_policy: "step"
step_size: 100000
The test accuracy is around 0.02 and test loss is around 1.6 at the first test. Then the test accuracy increase and the test loss decrease every test.
At iter 32000 the test accuracy is 1 and the test loss is 0.45.
Then the accuracy decrease and the loss increase.
I think the loss is too large when accuracy is 1.
How do I know the result I got is good or not?
Is there any method I can use to make an evaluation?