When should you update weights in neural network using backpropagation? - machine-learning

Let's say I have a 3 layer fully-connected neural network. I am implementing backpropagation algorithm. My question is, should I first calculate deltas and then after backpropagation is done, update the weights, or should I do it as I backpropagate through layers? I have seen both ways in internet tutorials.
I'm not sure because if I update weights during backpropagation I use newly updated weights (hidden to outputs weights) to calculate hidden layer deltas and I'm not sure is this is desired.
Sorry if I used incorrect terminology, I am new to this and trying to learn.

The classical approach is to update all the weights simultaneously, as a single operation. This may lead to the so called covariance shift (last layers are updated assuming old weights from the early layers), but that is where batch normalization helps.

Related

How should I optimize neural network for image classification using pretrained models

Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice
I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.
There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.

Is training a neural network with keep probabilty 1 the same as training it without dropout?

this seems to be a more theoretical question and i hope someone knows the answer, im using tensorflow to train a fully connected deep neural network, i apply dropout to my hidden layers and im investigating dropout in some cases.
I know that dropout is only applied to the input and hidden layers, for evaluation of the network the keep probability should be 1.0. For the case i want to train my network without dropout.... can i just set the keep probability on 1 on the hidden layers for training or do i have to remove it completely from my source-code?
Greetings
You can keep your code as is, a keep probability of 1.0 is indeed equal to no dropout, as every activations are kept.

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.

Making neural net to draw an image (aka Google's inceptionism) using nolearn\lasagne

Probably lots of people already saw this article by Google research:
http://googleresearch.blogspot.ru/2015/06/inceptionism-going-deeper-into-neural.html
It describes how Google team have made neural networks to actually draw pictures, like an artificial artist :)
I wanted to do something similar just to see how it works and maybe use it in future to better understand what makes my network to fail. The question is - how to achieve it with nolearn\lasagne (or maybe pybrain - it will also work but I prefer nolearn).
To be more specific, guys from Google have trained an ANN with some architecture to classify images (for example, to classify which fish is on a photo). Fine, suppose I have an ANN constructed in nolearn with some architecture and I have trained to some degree. But... What to do next? I don't get it from their article. It doesn't seem that they just visualize the weights of some specific layers. It seems to me (maybe I am wrong) like they do one of 2 things:
1) Feed some existing image or purely a random noise to the trained network and visualize the activation of one of the neuron layers. But - looks like it is not fully true, since if they used convolution neural network the dimensionality of the layers might be lower then the dimensionality of original image
2) Or they feed random noise to the trained ANN, get its intermediate output from one of the middlelayers and feed it back into the network - to get some kind of a loop and inspect what neural networks layers think might be out there in the random noise. But again, I might be wrong due to the same dimensionality issue as in #1
So... Any thoughts on that? How we could do the similar stuff as Google did in original article using nolearn or pybrain?
From their ipython notebook on github:
Making the "dream" images is very simple. Essentially it is just a
gradient ascent process that tries to maximize the L2 norm of
activations of a particular DNN layer. Here are a few simple tricks
that we found useful for getting good images:
offset image by a random jitter
normalize the magnitude of gradient
ascent steps apply ascent across multiple scales (octaves)
It is done using a convolutional neural network, which you are correct that the dimensions of the activations will be smaller than the original image, but this isn't a problem.
You change the image with iterations of forward/backward propagation just how you would normally train a network. On the forward pass, you only need to go until you reach the particular layer you want to work with. Then on the backward pass, you are propagating back to the inputs of the network instead of the weights.
So instead of finding the gradients to the weights with respect to a loss function, you are finding gradients to inputs with respect to the l2 Normalization of a certain set of activations.

How Sensitive Are FF Neural Networks?

CrossPost: https://stats.stackexchange.com/questions/103960/how-sensitive-are-neural-networks
I am aware of pruning, and am not sure if it removes the actual neuron or makes its weight zero, but I am asking this question as if a pruning process were not being used.
On variously sized feedforward neural networks on large datasets with lots of noise:
Is it possible one (or some trivial amount) extra OR missing hidden neurons OR hidden layers make or break a network? Or will its synapse weights simply degrade to zero if it is not necessary and compensate with the other neurons if it is missing one or two?
When experimenting, should input neurons be added one at a time or in groups of X? What is X? Increments of 5?
Lastly, should each hidden layer contain the same number of neurons? This is usually what I see in example. If not, how and why would you adjust their sizes if not relying on using pure experimentation?
I would prefer to overdo it and wait longer for a convergence than if larger networks will adapt itself to the solution. I have tried numerous configurations, but it is still difficult to gauge an optimum one.
1) Yes, absolutely. For example, if you have too less neurons in your hidden layer your model will be too simple and have high bias. Similarly, if you have too many neurons your model will overfit and have high variance. Adding more hidden layers allows you to model very complex problems like object recognition but there are a lot of tricks to make adding more hidden layers work; this is known as the field of deep learning.
2) In a single layered neural network its generally a rule of thumb to start with 2 times as many neurons as the number of inputs. You can determine the increment through binary search; i.e. run through a few different architectures and see how the accuracy changes..
3) No, definitely not - each hidden layer can contain as many neurons as you want it to contain. There is no way other can experimentation to determine their sizes; all of what you mention are hyperparameters which you must tune.
Im not sure if you are looking for a simple answer, but maybe you will be interested in a new neural network regularization technique called dropout. Dropout basically randomely "removes" some of the neurons during training forcing each of the neurons to be good feature detectors. It greatly prevents overfitting and you can go ahead and set the number of neurons to be high without worrying too much. Check this paper out for more info: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf

Resources