Sharing weights in parallel Convolutional Layer - niftynet

currently I am developing a new network using NiftyNet and would need some help.
I am trying to implement an Autofocus Layer [1] as proposed in the paper. However, at a certain point, the Autofocus Layer needs to calculate K (K=4) parallel convolutions each using the same weights (w) and concatenates the four outputs afterwards.
Is there a way to create four parallel convolutional layer with each having the same weights in NiftyNet?
Thank you in advance.
[1] https://arxiv.org/pdf/1805.08403.pdf

The solution to this problem is as follows.
There is no restriction allowing you to use the same convolutional layer multiple times, each time with another input. This simulates the desired parallelism and solves the weight sharing issue, because there is only one convolutional layer.
However, using this approach doesn't solve the issue having different dilation rates in each parallel layer - we only have one convolutional layer for the weight sharing problem as mentioned above.
Note: it is the same operation either using a given tensor as input
for a convolutional layer with dilation rate = 2 OR using a dilated
tensor with rate = 2 as input for a convolutional layer with dilation
rate = 1.
Therefore, creating K dilated tensors each with a different dilation rate and then using each of them as input for a single convolutional layer with dilation rate = 1 solves the problem having parallel layers each with a different dilation rate.
NiftyNet provides a class to create dilated tensors.

Related

Can identity NNs do edge detection for grayscale images?

Not CNNs, regular NNs. Also, I'm actually interested in making an AI based edge detector. I've read some papers, but none seem to kick start me. Can anyone share some getting started tips for making edge detectors with AI? CNNs work as classifiers, not image filters. So how can I?
Neural Network's back propagation technique is one of the popular techniques that mainly used for classification process. In the process of back propagation, a convolution matrix will be generated, a knowledge that actually generates the edge from gray level image.
But, I have another doubt what kind of learning you are opting for to train your NN, Supervised or Unsupervised?
Supervised- Train the network with a given set of data sets which can be an edge
Unsupervised- Create input layer with 5 inputs and subtract central pixel from all the neighbour four pixels and thresholding can be done at output layer.
You can even go for HYBRID APPROACH OF NEURO-FUZZY:-
One the given input image Sobel and Laplacian is applied.
Fuzzy rules are applied on the output we gain from these operators.
In neural network, input layer consists of gradient direction and hidden layer consists of fuzzy data. Both are used to train the network.
Hope, it helps

Caffe: what will happen if two layers backprop gradients to the same bottom blob?

I'm wondering what if I have a layer generating a bottom blob that is further consumed by two subsequent layers, both of which will generate some gradients to fill bottom.diff in the back propagation stage. Will both two gradients be added up to form the final gradient? Or, only one of them can live? In my understanding, Caffe layers need to memset the bottom.diff to all zeros before filling it with some computed gradients, right? Will the memset flush out the already computed gradients by the other layer? Thank you!
Using more than a single loss layer is not out-of-the-ordinary, see GoogLeNet for example: It has three loss layers "pushing" gradients at different depths of the net.
In caffe, each loss layer has a associated loss_weight: how this particular component contribute to the loss function of the net. Thus, if your net has two loss layers, Loss1 and Loss1 the overall loss of your net is
Loss = loss_weight1*Loss1 + loss_weight2*Loss2
The backpropagation uses the chain rule to propagate the gradient of Loss (the overall loss) through all the layers in the net. The chain rule breaks down the derivation of Loss into partial derivatives, i.e., the derivatives of each layer, the overall effect is obtained by propagating the gradients through the partial derivatives. That is, by using top.diff and the layer's backward() function to compute bottom.diff one takes into account not only the layer's derivative, but also the effect of ALL higher layers expressed in top.diff.
TL;DR
You can have multiple loss layers. Caffe (as well as any other decent deep learning framework) handles it seamlessly for you.

Can't understand how filters in a Conv net are calculated

I've been studying machine learning for 4 months, and I understand the concepts behind the MLP. The problem came when I started reading about Convolutional Neural Networks. Let me tell you what I know and then ask what I'm having trouble with.
The core parts of a CNN are:
Convolutional Layer: you have "n" number of filters that you use to generate "n" feature maps.
RELU Layer: you use it for normalizing the output of the convolutional layer.
Sub-sampling Layer: used for "generating" a new feature map that represents more abstract concepts.
Repeat the first 3 layers some times and the last part is a common Classifier, such as a MLP.
My doubts are the following:
How do I create the filters used in the Convolutional Layer? Do I have to create a filter, train it, and then put it in the Conv Layer, or do I train it with the backpropagation algorithm?
Imagine I have a conv layer with 3 filters, then it will output 3 feature maps. After applying the RELU and Sub-sampling layer, I will still have 3 feature maps (smaller ones). When passing again through the Conv Layer, how do I calculate the output? Do I have to apply the filter in each feature map separately, or do some kind of operation over the 3 feature maps and then make the sum? I don't have any idea of how to calculate the output of this second Conv Layer, and how many feature maps it will output.
How do I pass the data from the Conv layers to the MLP (for classification in the last part of the NN)?
If someone knows of a simple implementation of a CNN without using a framework I will appreciate it. I think the best way of learning how stuff works is by doing it by yourself. In another time, when you already know how stuff works, you can use frameworks, because they save you a lot of time.
You train it with backpropagation algorithm, the same way as you train MLP.
You apply each filter separately. For example if you have 10 feature maps in the first layer and the filter shape of one of the feature maps from the second layer is 3*3, then you apply 3*3 filter to each of the ten feature maps in the first layer, weights for each feature map are different, in this case one filter will have 3*3*10 weights.
To understand it easier, keep in mind that a pixel of a non-grayscale image has three values - red, green and blue, so if you're passing images to a convolutional neural network ,then in the input layer you alredy have 3 feature maps(for RGB), so one value in the next layer will be connected too all 3 feature maps in the first layer
You should flatten the convolutional feature maps, for example if you have 10 feature maps with the size of 5*5, then you will have a layer with 250 values and then nothing different from MLP, you connect all of these artificial neurons to all of the artificial neurons in the next layer by weights.
Here someone has implemented convolutional neural network without frameworks.
I would also recommend you those lectures.

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.

Do convolutional neural networks suffer from the vanishing gradient?

I think I read somewhere that convolutional neural networks do not suffer from the vanishing gradient problem as much as standard sigmoid neural networks with increasing number of layers. But I have not been able to find a 'why'.
Does it truly not suffer from the problem or am I wrong and it depends on the activation function?
[I have been using Rectified Linear Units, so I have never tested the Sigmoid Units for Convolutional Neural Networks]
Convolutional neural networks (like standard sigmoid neural networks) do suffer from the vanishing gradient problem. The most recommended approaches to overcome the vanishing gradient problem are:
Layerwise pre-training
Choice of the activation function
You may see that the state-of-the-art deep neural network for computer vision problem (like the ImageNet winners) have used convolutional layers as the first few layers of the their network, but it is not the key for solving the vanishing gradient. The key is usually training the network greedily layer by layer. Using convolutional layers have several other important benefits of course. Especially in vision problems when the input size is large (the pixels of an image), using convolutional layers for the first layers are recommended because they have fewer parameters than fully-connected layers and you don't end up with billions of parameters for the first layer (which will make your network prone to overfitting).
However, it has been shown (like this paper) for several tasks that using Rectified linear units alleviates the problem of vanishing gradients (as oppose to conventional sigmoid functions).
Recent advances had alleviate the effects of vanishing gradients in deep neural networks. Among contributing advances include:
Usage of GPU for training deep neural networks
Usage of better activation functions. (At this point rectified linear units (ReLU) seems to work the best.)
With these advances, deep neural networks can be trained even without layerwise pretraining.
Source:
http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/
we do not use Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems. Mostly nowadays we use RELU based activation functions in training a Deep Neural Network Model to avoid such complications and improve the accuracy.
It’s because the gradient or slope of RELU activation if it’s over 0, is 1. Sigmoid derivative has a maximum slope of .25, which means that during the backward pass, you are multiplying gradients with values less than 1, and if you have more and more layers, you are multiplying it with values less than 1, making gradients smaller and smaller. RELU activation solves this by having a gradient slope of 1, so during backpropagation, there isn’t gradients passed back that are progressively getting smaller and smaller. but instead they are staying the same, which is how RELU solves the vanishing gradient problem.
One thing to note about RELU however is that if you have a value less than 0, that neuron is dead, and the gradient passed back is 0, meaning that during backpropagation, you will have 0 gradient being passed back if you had a value less than 0.
An alternative is Leaky RELU, which gives some gradient for values less than 0.
The first answer is from 2015 and a bit of age.
Today, CNNs typically also use batchnorm - while there is some debate why this helps: the inventors mention covariate shift: https://arxiv.org/abs/1502.03167
There are other theories like smoothing the loss landscape: https://arxiv.org/abs/1805.11604
Either way, it is a method that helps to deal significantly with vanishing/exploding gradient problem that is also relevant for CNNs. In CNNs you also apply the chain rule to get gradients. That is the update of the first layer is proportional to the product of N numbers, where N is the number of inputs. It is very likely that this number is either relatively big or small compared to the update of the last layer. This might be seen by looking at the variance of a product of random variables that quickly grows the more variables are being multiplied: https://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables
For recurrent networks that have long sequences of inputs, ie. of length L, the situation is often worse than for CNN, since there the product consists of L numbers. Often the sequence length L in a RNN is much larger than the number of layers N in a CNN.

Resources