I am looking through the code in the library. In the paper, (page 6, second column, first para), it is stated as convolutional layers are fixed (during training in third and fourth steps) and tuned the RPN layers and Fast RCNN layers.
Which portion of the code is taking care of it?
I looked at the code and Solver.cpp is the one controlling Forward/Backward.
I don't see implementation of fixing convolutional layers there.
Then all prototxt files have similar implementations for layers.
How this fixing convolutional layers in training is implemented?
When freezing a layer during fine-tuning, one usually sets
param { lr_mult: 0 }
for that layer, this way caffe does not update the weights for this layer.
Related
If I understand correctly, one neuron per layer is enough since the layer will just be unrolled through time to accomodate a long sequence.
How can a recurrent layer contain several neurons?
Aren't the neurons in one layer essentially the same if they were unrolled through time?
Neural networks (MLP, CNN, RNN) are expected to have multiple neurons through multiple layers. One neuron per layer is hardly enough and will most likely provide a linear solution to the issue being addressed and its architecture will be too small to deal with any kind of real-life situation.
From Brandon Rohrer's video Recurrent Neural Networks and LSTM you can see a very simple structure containing multiple neurons (dots) for a single layer. Imagine this simple model working with only one model? It will prove to perform very poorly.
Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice
I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.
There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.
I'm wondering what if I have a layer generating a bottom blob that is further consumed by two subsequent layers, both of which will generate some gradients to fill bottom.diff in the back propagation stage. Will both two gradients be added up to form the final gradient? Or, only one of them can live? In my understanding, Caffe layers need to memset the bottom.diff to all zeros before filling it with some computed gradients, right? Will the memset flush out the already computed gradients by the other layer? Thank you!
Using more than a single loss layer is not out-of-the-ordinary, see GoogLeNet for example: It has three loss layers "pushing" gradients at different depths of the net.
In caffe, each loss layer has a associated loss_weight: how this particular component contribute to the loss function of the net. Thus, if your net has two loss layers, Loss1 and Loss1 the overall loss of your net is
Loss = loss_weight1*Loss1 + loss_weight2*Loss2
The backpropagation uses the chain rule to propagate the gradient of Loss (the overall loss) through all the layers in the net. The chain rule breaks down the derivation of Loss into partial derivatives, i.e., the derivatives of each layer, the overall effect is obtained by propagating the gradients through the partial derivatives. That is, by using top.diff and the layer's backward() function to compute bottom.diff one takes into account not only the layer's derivative, but also the effect of ALL higher layers expressed in top.diff.
TL;DR
You can have multiple loss layers. Caffe (as well as any other decent deep learning framework) handles it seamlessly for you.
I am using Deep learning caffe framework for image classification.
I have coins with faces. Some of them are left directed some of them are right.
To classify them I am using common aproach - take weights and structure from pretrained ImageNet network that have already capture a lot of image patterns and train mostly the last layer to fit my training set.
But I have found that netowork does not works on this set:
I have taken some coin for example leftdirected , generated horizontally flipped image for it and marked it as right sided.
For this set convolutional net gets ~50% accuracy, it is exactly random result.
I have also tried to train net on 2 images ( 2 flipped versions of "h" letter ). But with the same result - 50% . ( If I choose to diffetrent letters and train net on augemeneted dataset - i receive 100% accuracy very fast ) . But invariance to flipping brokes my classification.
My question is: is there exists some aproach that allowes me to use advantages of pretrained imagenet but broke somehow this invariance. And what layer on net make invariance possible.
I am using "caffe" for generating net based on this example approach:
https://github.com/BVLC/caffe/blob/master/examples/02-fine-tuning.ipynb
Caffe basic/baseline models trained on image net mostly use the very trivial image augmentation: flipping images horizontally. That is, imagenet classes are indeed the same when flipped horizontally. Thus, the weights you are trying to fine-tune were trained in a setting where horizontal flip should be ignored and I suppose what you see is a net that captured this quite well - it is no longer sensitive to this particular transformation.
It is not trivial to tell at what layer of the net this invariance is happening and therefore it is not easy to say what layers should be fine-tuned to overcome this behavior. I suppose this invariance is quite fundamental to the network and I will not be surprise if it required re-training of the entire net.
I think I read somewhere that convolutional neural networks do not suffer from the vanishing gradient problem as much as standard sigmoid neural networks with increasing number of layers. But I have not been able to find a 'why'.
Does it truly not suffer from the problem or am I wrong and it depends on the activation function?
[I have been using Rectified Linear Units, so I have never tested the Sigmoid Units for Convolutional Neural Networks]
Convolutional neural networks (like standard sigmoid neural networks) do suffer from the vanishing gradient problem. The most recommended approaches to overcome the vanishing gradient problem are:
Layerwise pre-training
Choice of the activation function
You may see that the state-of-the-art deep neural network for computer vision problem (like the ImageNet winners) have used convolutional layers as the first few layers of the their network, but it is not the key for solving the vanishing gradient. The key is usually training the network greedily layer by layer. Using convolutional layers have several other important benefits of course. Especially in vision problems when the input size is large (the pixels of an image), using convolutional layers for the first layers are recommended because they have fewer parameters than fully-connected layers and you don't end up with billions of parameters for the first layer (which will make your network prone to overfitting).
However, it has been shown (like this paper) for several tasks that using Rectified linear units alleviates the problem of vanishing gradients (as oppose to conventional sigmoid functions).
Recent advances had alleviate the effects of vanishing gradients in deep neural networks. Among contributing advances include:
Usage of GPU for training deep neural networks
Usage of better activation functions. (At this point rectified linear units (ReLU) seems to work the best.)
With these advances, deep neural networks can be trained even without layerwise pretraining.
Source:
http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/
we do not use Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems. Mostly nowadays we use RELU based activation functions in training a Deep Neural Network Model to avoid such complications and improve the accuracy.
It’s because the gradient or slope of RELU activation if it’s over 0, is 1. Sigmoid derivative has a maximum slope of .25, which means that during the backward pass, you are multiplying gradients with values less than 1, and if you have more and more layers, you are multiplying it with values less than 1, making gradients smaller and smaller. RELU activation solves this by having a gradient slope of 1, so during backpropagation, there isn’t gradients passed back that are progressively getting smaller and smaller. but instead they are staying the same, which is how RELU solves the vanishing gradient problem.
One thing to note about RELU however is that if you have a value less than 0, that neuron is dead, and the gradient passed back is 0, meaning that during backpropagation, you will have 0 gradient being passed back if you had a value less than 0.
An alternative is Leaky RELU, which gives some gradient for values less than 0.
The first answer is from 2015 and a bit of age.
Today, CNNs typically also use batchnorm - while there is some debate why this helps: the inventors mention covariate shift: https://arxiv.org/abs/1502.03167
There are other theories like smoothing the loss landscape: https://arxiv.org/abs/1805.11604
Either way, it is a method that helps to deal significantly with vanishing/exploding gradient problem that is also relevant for CNNs. In CNNs you also apply the chain rule to get gradients. That is the update of the first layer is proportional to the product of N numbers, where N is the number of inputs. It is very likely that this number is either relatively big or small compared to the update of the last layer. This might be seen by looking at the variance of a product of random variables that quickly grows the more variables are being multiplied: https://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables
For recurrent networks that have long sequences of inputs, ie. of length L, the situation is often worse than for CNN, since there the product consists of L numbers. Often the sequence length L in a RNN is much larger than the number of layers N in a CNN.