What bookkeeping is caffe doing? - machine-learning

Caffe tutorial states:
The net is a set of layers connected in a computation graph – a directed acyclic graph (DAG) to be exact. Caffe does all the bookkeeping for any DAG of layers to ensure correctness of the forward and backward passes.
What is the meaning by "all the bookkeeping"? I don't understand it.
How to do all the bookkeeping?

Caffe, like many other deep-learning frameworks, trains its models using stochastic gradient decent (SGD), implemented as gradient back propagation. That is, for a mini-batch of training examples, caffe feed the batch through the net ("forward pass") to compute the loss w.r.t the net's parameters. Then, it propagates the loss gradient back ("backward pass") to update all the parameters according to the estimated gradient.
By "bookkeeping" the tutorial means, you do not need to worry about estimating the gradients and updating the parameters. Once you are using existing layers (e.g., "Convolution", "ReLU", "Sigmoid" etc.) you only need to define the graph structure (the net's architecture) and supply the training data, and caffe will take care of the rest of the training process: It will forward/backward each mini batch, compute the loss, estimate the gradients and update the parameters - all for you.
Pretty awesome, don't you think? ;)

Related

How does Caffe updates gradients of a blob with more than one output branch?

Caffe supports multiple losses. Then for the backpropagation stage, some blobs may have multiple gradients coming from different losses. How does Caffe do with the gradients of this blob?
As far as I know, this may not be a concern when designing networks. But this question really confuse me when I try to write a new layer. Thanks for any idea!
This is not an issue of caffe or any other deep-learning tool. This is purely a mathematical question: When you have several losses, you have loss_weight assigned to each loss and the overall loss of the net is the weighted sum of all losses. Consequently, the gradients computed for the net are gradients of the weighted sum of the losses: there is no gradient-per-loss that needs to be integrated, but rather a single loss which is a weighted sum of loss layers.
Caffe usually uses "Split" layer when directing the "top" of a layer into several layers (in your example the output of "conv2" is "Split" to a "bottom" of "auxiliary loss" and "ip1").
Looking at the code of backward propagation of "Split" layer you can see that the all top.diffs are summed into bottom.diff.

Why faster-rcnn end to end training only makes approximation?

In faster rcnn (https://arxiv.org/abs/1506.01497),
there are two ways to train the network.
one way is jointly training rpn and fast rcnn.
the other way is to train both rpn and fast rcnn in the end-to-end manner.
However, the author said that in the end-to-end training, the result is only approximation to jointly training.
the reason for only approximation is
this solution ignores the derivative w.r.t. the proposal boxes’ coordinates that are also network responses, so is approximate.
However, from the network definition (https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/pascal_voc/VGG16/faster_rcnn_end2end/train.prototxt), the bounding box regression for rpn is updated for each training iteration, so it's not ignored.
so, why it ignores the derivative of proposal boxes coordinates? what does that mean?
The slide Training R-CNNs of various velocities talks about this in detail in page 40-45. In short, this is because the derivative of the loss function to the ROI layer is undefined, so a surrogate gradient is used, in that case, this derivative is undefined.
P.S.
Link to ICCV 2015 Tutorial
The Github README page guide me to the slide

Noisy validation loss in Keras when using fit_generator

Any idea about why our training loss is smooth and our validation loss is that noisy (see the link) across epochs? We are implementing a deep learning model for diabetic retinopathy detection (binary classification) using the data set of fundus photographs provided by this Kaggle competition. We are using Keras 2.0 with Tensorflow backend.
As the data set is too big to fit in memory, we are using fit_generator, with ImageDataGenerator randomly taking images from training and validation folders:
# TRAIN THE MODEL
model.fit_generator(
train_generator,
steps_per_epoch= train_generator.samples // training_batch_size,
epochs=int(config['training']['epochs']),
validation_data=validation_generator,
validation_steps= validation_generator.samples // validation_batch_size,
class_weight=None)
Our CNN architecture is VGG16 with dropout = 0.5 in the last two fully connected layers, batch normalization only before the first fully connected layer, and data augmentation (consisting on flipping the images horizontally and vertically). Our training and validation samples are normalized using the training set mean and standard deviation. Batch size is 32. Our activation is a sigmoid and the loss function is the binary_crossentropy. You can find our implementation in Github
It definitely has nothing to do with overfitting, as we tried with a highly regularized model and the behavior was quite the same. Is it related with the sampling from the validation set? Has any of you had a similar problem before?
Thanks!!
I would look, in that order:
bug in validation_generator implementation (incl. steps - does it go through all pics reserved for validation?)
in validation_generator, do not use augmentation (reason: an augmentation might be bad, not learnable, and at train, it does achieve a good score only by hard-coding relationships which are not generalizable)
change train/val split to 50/50
calculate, via a custom callback, the validation loss at the end of the epoch (use the same function, but calling it with a callback produces different (more accurate, at certain, non-standard models) results)
If nothing of the above gives a more smooth validation loss curve, then my next assumption would be that this is the way it is, and I might need to work on the model architecture

Training of a CNN using Backpropagation

I have earlier worked in shallow(one or two layered) neural networks, so i have understanding of them, that how they work, and it is quite easy to visualize the derivations for forward and backward pass during the training of them, Currently I am studying about Deep neural networks(More precisely CNN), I have read lots of articles about their training, but still I am unable to understand the big picture of the training of the CNN, because in some cases people using pre- trained layers where convolution weights are extracted using auto-encoders, in some cases random weights were used for convolution, and then using back propagation they train the weights, Can any one help me to give full picture of the training process from input to fully connected layer(Forward Pass) and from fully connected layer to input layer (Backward pass).
Thank You
I'd like to recommend you a very good explanation of how to train a multilayer neural network using backpropagation. This tutorial is the 5th post of a very detailed explanation of how backpropagation works, and it also has Python examples of different types of neural nets to fully understand what's going on.
As a summary of Peter Roelants tutorial, I'll try to explain a little bit what is backpropagation.
As you have already said, there are two ways to initialize a deep NN: with random weights or pre-trained weights. In the case of random weights and for a supervised learning scenario, backpropagation works as following:
Initialize your network parameters randomly.
Feed forward a batch of labeled examples.
Compute the error (given by your loss function) within the desired output and the actual one.
Compute the partial derivative of the output error w.r.t each parameter.
These derivatives are the gradients of the error w.r.t to the network's parameters. In other words, they are telling you how to change the value of the weights in order to get the desired output, instead of the produced one.
Update the weights according to those gradients and the desired learning rate.
Perform another forward pass with different training examples, repeat the following steps until the error stops decreasing.
Starting with random weights is not a problem for the backpropagation algorithm, given enough training data and iterations it will tune the weights until they work for the given task.
I really encourage you to follow the full tutorial I linked, because you'll get a very detalied view of how and why backpropagation works for multi layered neural networks.

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.
Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.
Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Resources