batch_loss and total_loss=tf.get_total_loss() in tensorflow - machine-learning

I get a problem when I read im2txt source code in im2txt.
There are batch_loss and total_loss: batch_loss is computed for every batch data, and is added into tf.Graphkeys.LOSSES by tf.add_loss(batch_loss) call. The total_loss is got by tf.losses.get_total_loss(), which average the all loss in tf.Graphkeys.LOSSES.
Question: why parameters are updated by total_loss? I was confused by this problem many days.

The summary of discussion in the comments:
The training loss is computed in the forward pass over the mini-batch. But the actual loss values aren't needed to begin the backprop. The backprop is started with the error signal, which equals to the loss function derivative evaluated at the values from the forward pass. So the loss value doesn't affect the parameters update and is reported simply to monitor the training process. For example, if the loss does not decrease, it's a sign to double check the neural network model and hyperparameters. So it's not a big deal to smooth the reported loss through averaging just to make a chart look nicer.
See this post for more details.

Related

Validation and training loss per batch and epoch

I am using Pytorch to run some deep learning models. I am currently keeping track of training and validation loss per epoch, which is pretty standard. However, what is the best way of going about keeping track of training and validation loss per batch/iteration?
For training loss, I could just keep a list of the loss after each training loop. But, validation loss is calculated after a whole epoch, so I’m not sure how to go about the validation loss per batch. The only thing I can think of is to run the whole validation step after each training batch and keeping track of those, but that seems overkill and a lot of computation.
For example, the training is like this:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
And for validation loss:
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
# validation loss
batch_loss = error(outputs.float(), labels.long()).item()
loss_test += batch_loss
loss_test /= len(testloader)
The validation loss/test part is done per epoch. I’m looking for a way to get the validation loss per batch, which is my point above.
Any tips?
Well, you're right that's the way to do it "run the whole validation step after each training batch and keeping track of those" and also as you've thought it's pretty time-consuming and would be overkill. However, If that's something you really need then there's a way you can do it. What you can do is, let's say you've 1000 batches in your data. Now to calculate per batch val_loss you can choose not to run the validation step for each of the batch (then you'd have to do it 1000 times!) but for a small subset of those batches, let's say 50/100 (choose as you please or find feasible). Now, you can use some statistical power so that your calculation for 50/100 batches becomes very very close to that of 1000 batches (meaning this val_loss for a small number of batches must be as close as to those of 1000 batches if you had calculated that), so to achieve it you can introduce some randomness in your batch selection.
This means you randomly select 100 batches from your 1000 batches for which you'll run the validation step.
An epoch is the process of making the model go through the entire training set - which is, generally, divided into batches. Also, it tends to be shuffled. The validation set, on the other hand is used to tune the hyper-parameters of your training and find out what's your model's behavior towards new data. In that respect, to me, evaluating at epoch=1/2 doesn't make much sense. Because the question is - whatever the performance on the evaluation set at epoch=1/2 - what can you do about it? Since, you don't know which data it has been going through in the first half of the epoch, there's no way to take advantage of 'a first half being better'... And remember your data will likely be shuffled into batches.
Therefore, I would stick with the classic approach: train on the entire set then, and only then, evaluate on another set. In some cases, you won't even allow yourself to evaluate once per epoch, because of the computation time. Instead you would evaluate every n epochs. But then again it will depend on your dataset size, your sampling from that dataset, the batch size, and the computation cost.
For the training loss, you can keep track of its value per-update-step vs. per-epoch. This will give you much more control over whether or not your model is learning independently from the validation phase.
Edit - As an alternative for not having to run the entire evaluation set per train batch you could do the following: shuffle your validation and set the same batch size as your trainset.
len(trainset)//batch_size is the number of updates per epoch
len(validset)//batch_size is the number of allowed evaluation per epoch
Every len(trainset)//len(validset) train updates you can evaluate on 1
batch
This allows you to get a feedback len(trainset)//len(validset) times per epoch.
If you set your train/valid ratio as 0.1, then len(validset)=0.1*len(trainset), that's ten partial evaluations per epoch.

Do I need to add ReLU function before last layer to predict a positive value?

I am developing a model using linear regression to predict the age. I know that the age is from 0 to 100 and it is a possible value. I used conv 1 x 1 in the last layer to predict the real value. Do I need to add a ReLU function after the output of convolution 1x1 to guarantee the predicted value is a positive value? Currently, I did not add ReLU and some predicted value becomes negative value like -0.02 -0.4…
There's no compelling reason to use an activation function for the output layer; typically you just want to use a reasonable/suitable loss function directly with the penultimate layer's output. Specifically, a RELU doesn't solve your problem (or at most only solves 'half' of it) since it can still predict above 100. In this case -predicting a continuous outcome- there's a few standard loss functions like squared error or L1-norm.
If you really want to use an activation function for this final layer and are concerned about always predicting within a bounded interval, you could always try scaling up the sigmoid function (to between 0 and 100). However, there's nothing special about sigmoid here - any bounded function, ex. any CDF of a signed, continuous random variable, could be similarly used. Though for optimization, something easily differentiable is important.
Why not start with something simple like squared-error loss? It's always possible to just 'clamp' out-of-range predictions to within [0-100] (we can give this a fancy name like 'doubly RELU') when you need to actually make predictions (as opposed to during training/testing), but if you're getting lots of such errors, the model might have more fundamental problems.
Even for a regression problem, it can be good (for optimisation) to use a sigmoid layer before the output (giving a prediction in the [0:1] range) followed by a denormalization (here if you think maximum age is 100, just multiply by 100)
This tip is explained in this fast.ai course.
I personally think these lessons are excellent.
You should use a sigmoid activation function, and then normalize the targets outputs to the [0, 1] range. This solves both issues of being positive and with a limit.
You can easily then denormalize the neural network outputs to get an output in the [0, 100] range.

How does Caffe determine test set accuracy?

Using the BVLC reference AlexNet file, I have been training a CNN against a training set I created.  In order to measure the progress of training, I have been using a rough method to approximate the accuracy against the training data.  My batch size on the test net is 256.  I have ~4500 images.  I perform 17 calls to solver.test_nets[0].forward() and record the value of solver.test_nets[0].blobs['accuracy'].data (the accuracy of that forward pass).  I take the average across these.  My thought was that I was taking 17 random samples of 256 from my validation set and getting the accuracy of these random samplings.  I would expect this to closely approximate the true accuracy against the entire set.  However, I later went back and wrote a script to go through each item in my LMDB so that I could generate a confusion matrix for my entire test set.  I discovered that the true accuracy of my model was significantly lower than the estimated accuracy.  For example, my expected accuracy of ~75% dropped to ~50% true accuracy.  This is a far worse result than I was expecting.
My assumptions match the answer given here.
Have I made an incorrect assumption somewhere?  What could account for the difference?  I had assumed that forward() function gathered a random sample, but I'm not so sure that was the case.  blobs.['accuracy'].data returned a different result (though usually within a small range) everytime, so this is why I assumed this.
I had assumed that forward() function gathered a random sample, but I'm not so sure that was the case. blobs.['accuracy'].data returned a different result (though usually within a small range) everytime, so this is why I assumed this.
The forward() function from Caffe does not perform any random sampling, it will only fetch the next batch according to your DataLayer. E.g., in your case forward() will pass the next 256 images in your network. Performing this 17 times will pass sequentially 17x256=4352 images.
Have I made an incorrect assumption somewhere? What could account for the difference?
Check that the script that goes through your whole LMDB performs the same data pre-processing as during training.

Backpropogation neural network - error not converging

I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Resources