AFAIK, we have two ways to obtain the validation loss.
(1) online during training process by setting the solver as follows:
train_net: 'train.prototxt'
test_net: "test.prototxt"
test_iter: 200
test_interval: 100
(2) offline based on the weight in the .caffemodel file. In this question, I regard to the second way due to limited GPU. First, I saved the weight of network to .caffemodel after each 100 iterations by snapshot: 100. Based on these .caffemodel, I want to calculate the validation loss
../build/tools/caffe test -model ./test.prototxt -weights $snapshot -iterations 10 -gpu 0
where snapshot is file name of .caffemodel. For example snap_network_100.caffemodel
And the data layer of my test prototxt is
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TEST
}
hdf5_data_param {
source: "./list.txt"
batch_size: 8
shuffle: true
}
}
The first and the second ways give different validation loss. I found that the first way the validation loss independent of batch size. It means the validation loss is same with different batch size. While, the second way, the validation loss changed with different batch size but the loss is very close together with different iterations.
My question is that which way is correct to compute validation loss?
You compute the validation loss for different number of iterations:
test_iter: 200
In your 'solver.prototxt', vs. -iterations 10 when running from command line. This means you are averaging the loss over different number of validation samples.
Since you are using far less samples when validating from command line, you are much more sensitive to batch_size.
Make sure you are using exactly the same settings and verify that the validation loss is indeed the same.
Related
I am using caffe with the HDF5 layer. It will read my hdf5list.txt as
/home/data/file1.h5
/home/data/file2.h5
/home/data/file3.h5
In each file*.h5, I have 10.000 images. So, I have about 30.000 images in total. In each iteration, I will use batch size is 10 as the setting
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "label"
hdf5_data_param {
source: "./hdf5list.txt"
batch_size: 10
shuffle: true
}
include {
phase: TRAIN
}
}
Using caffe, Its output likes
Iterations 10, loss=100
Iterations 20, loss=90
...
My question is that how to compute the a number of epoch, respect to the loss? It means I want to plot a graph with x-axis is number of epoch and y-asix is loss.
Related link: Epoch vs iteration when training neural networks
If you want to do this for just the current problem, it is super easy. Note that
Epoch_index = floor((iteration_index * batch_size) / (# data_samples))
Now, in solver.cpp, find the line where Caffe prints Iterations ..., loss = .... Just compute epoch index using the above formula and print that too. You are done. Do not forget to recompile Caffe.
If you want to modify Caffe so that it always shows the epoch index, then you will first need to compute the data size from all your HDF5 files. By glancing the Caffe HDF5 layer code, I think you can get the number of data samples by hdf_blobs_[0]->shape(0). You should add this up for all HDF5 files and use that number in solver.cpp.
The variable hdf_blobs_ is defined in layers/hdf5_data_layer.cpp. I believe it is populated in the function util/hdf5.cpp. I think this is how the flow goes:
In layers/hdf5_data_layer.cpp, the hdf5 filenames are read from the text file.
Then a function LoadHDF5FileData attempts to load the hdf5 data into blobs.
Inside LoadHDF5FileData, the blob variable - hdf_blobs_ - is declared and it is populated inside the function util/hdf5.cpp.
Inside util/hdf5.cpp, the function hdf5_load_nd_dataset first calls hdf5_load_nd_dataset_helper that reshapes the blobs accordingly. I think this is where you will get the dimensions of your data for one hdf5 file. Iterating over multiple hdf5 files is done in the void HDF5DataLayer<Dtype>::Next() function in layers/hdf5_data_layer.cpp. So here you need to add up the data dimensions received earlier.
Finally, you need to figure out how to pass them back till solver.cpp.
Im'using CNN for short text classification (classify the production title).
The code is from
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The accuracy in trainning set, test set, validatino set is blow:
and loss is different. The loss of validation is double than the loss of trainning set and test set.(I can't upload more than 2 pictures. sorry!)
The trainning set and test set are from web by crawler, then split them with 7:3.And the validation is from real app message and tagged by manual marking.
I have tried almost every hyper-parameters.
I have tried up-sampling, down-sampling, none-sampling.
batch size of 1024, 2048, 5096
dropout with 0.3, 0.5, 0.7
embedding_size with 30, 50, 75
But none of these work!
Now I use the param below:
batch size is 2048.
embedding_size is 30.
sentence_length is 15
filter_size is 3,4,5
dropout_prob is 0.5
l2_lambda is 0.005
At first I think it is overfit.But the model performs well in test set then trainning set.So I confused!
Is it the distribution between test set and trainning set is much different?
How can I increase the performance in validation set?
I think this difference in loss comes from the fact that the validation dataset was collected from a different domain than the training/test sets:
The training set and test set are from web by crawler, then split them
with 7:3.And the validation is from real app message and tagged by manual > marking
The model did not see any real app message data during training, so it unsurprisingly fails to deliver good results on the validation set. Traditionally, all three sets are generated from the same pool of data (say, with a 7-1-2 split). The validation set is used for hyperparameter tuning (batch_size, embedding_length, etc.), while the test set is held-out for an objective measure of model performance.
If you are concerned ultimately concerned with performance on the app data, I would split that dataset up 7-1-2 (train-validation-test) and augment the training data with web crawler data.
I think the loss on validation set is high because the validation data comes from real app message data which may be more realistic than the training data you obtained from web crawling which may contain noise. Your learning rate is very high and batch size if pretty big than what's recommended. You can try learning rates in [0.1, 0.01, 0.001 and 0.0001], batch size in [32, 64], other hyperparameter values seems to be okay.
I would like to comment on the training, validation and test set. Training data is split into training and validation sets for training while test set is the data we don't touch and use only to test our model at last. I think your validation set is the 'test set' and your test set is the 'validation set'. That's how I would refer to them.
Using the BVLC reference AlexNet file, I have been training a CNN against a training set I created. In order to measure the progress of training, I have been using a rough method to approximate the accuracy against the training data. My batch size on the test net is 256. I have ~4500 images. I perform 17 calls to solver.test_nets[0].forward() and record the value of solver.test_nets[0].blobs['accuracy'].data (the accuracy of that forward pass). I take the average across these. My thought was that I was taking 17 random samples of 256 from my validation set and getting the accuracy of these random samplings. I would expect this to closely approximate the true accuracy against the entire set. However, I later went back and wrote a script to go through each item in my LMDB so that I could generate a confusion matrix for my entire test set. I discovered that the true accuracy of my model was significantly lower than the estimated accuracy. For example, my expected accuracy of ~75% dropped to ~50% true accuracy. This is a far worse result than I was expecting.
My assumptions match the answer given here.
Have I made an incorrect assumption somewhere? What could account for the difference? I had assumed that forward() function gathered a random sample, but I'm not so sure that was the case. blobs.['accuracy'].data returned a different result (though usually within a small range) everytime, so this is why I assumed this.
I had assumed that forward() function gathered a random sample, but I'm not so sure that was the case. blobs.['accuracy'].data returned a different result (though usually within a small range) everytime, so this is why I assumed this.
The forward() function from Caffe does not perform any random sampling, it will only fetch the next batch according to your DataLayer. E.g., in your case forward() will pass the next 256 images in your network. Performing this 17 times will pass sequentially 17x256=4352 images.
Have I made an incorrect assumption somewhere? What could account for the difference?
Check that the script that goes through your whole LMDB performs the same data pre-processing as during training.
I ran caffe and got this output:
who can tell me what is the problem?
I will really appreciate!!
It seems like one (or more) of your label values are invalid, see this PR for information:
If you have an invalid ground truth label, "SoftmaxWithLoss" will silently access invalid memory [...] The old check only worked in DEBUG mode and also only worked for CPU.
Make sure your prediction vector length matches the number of labels you try to predict.
From your comments, it seems like you have labels in the range 0..10575, but on the other hand, your classification layer, "fc7" only predicts probabilities for 1000 classes. Thus, "SoftmaxWithLoss" layer tries to compute the loss for predicting label l>1000, and access memory outside the probability array, resulting with a segmentation fault.
I've noticed that a frequent occurrence during training is NANs being introduced.
Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.
Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?
The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?
I came across this phenomenon several times. Here are my observations:
Gradient blow up
Reason: large gradients throw the learning process off-track.
What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.
What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.
Bad learning rate policy and params
Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.
What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:
... sgd_solver.cpp:106] Iteration 0, lr = -nan
What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.
Faulty Loss function
Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.
What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.
For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.
Faulty input
Reason: you have an input with nan in it!
What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.
stride larger than kernel size in "Pooling" layer
For some reason, choosing stride > kernel_size for pooling may results with nans. For example:
layer {
name: "faulty_pooling"
type: "Pooling"
bottom: "x"
top: "y"
pooling_param {
pool: AVE
stride: 5
kernel: 3
}
}
results with nans in y.
Instabilities in "BatchNorm"
It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.
Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.
In my case, not setting the bias in the convolution/deconvolution layers was the cause.
Solution: add the following to the convolution layer parameters.
bias_filler {
type: "constant"
value: 0
}
This answer is not about a cause for nans, but rather proposes a way to help debug it.
You can have this python layer:
class checkFiniteLayer(caffe.Layer):
def setup(self, bottom, top):
self.prefix = self.param_str
def reshape(self, bottom, top):
pass
def forward(self, bottom, top):
for i in xrange(len(bottom)):
isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
if isbad>0:
raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isbad)/bottom[i].count))
def backward(self, top, propagate_down, bottom):
for i in xrange(len(top)):
if not propagate_down[i]:
continue
isf = np.sum(1-np.isfinite(top[i].diff[...]))
if isf>0:
raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isf)/top[i].count))
Adding this layer into your train_val.prototxt at certain points you suspect may cause trouble:
layer {
type: "Python"
name: "check_loss"
bottom: "fc2"
top: "fc2" # "in-place" layer
python_param {
module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
layer: "checkFiniteLayer"
param_str: "prefix-check_loss" # string for printouts
}
}
learning_rate is high and should be decreased
The accuracy in the RNN code was nan, with select the low value for learning rate it fixes
One more solution for anyone stuck like I just was-
I was receiving nan or inf losses on a network I setup with float16 dtype across the layers and input data. After all else failed, it occurred to me to switch back to float32, and the nan losses were solved!
So bottom line, if you switched dtype to float16, change it back to float32.
I was trying to build a sparse autoencoder and had several layers in it to induce sparsity. While running my net, I encountered the NaN's. On removing some of the layers (in my case, I actually had to remove 1), I found that the NaN's disappeared. So, I guess too much sparsity may lead to NaN's as well (some 0/0 computations may have been invoked!?)