Why is Gradient Checking Slow For Back Propagation? - machine-learning

I recently learned the algorithm "Gradient Checking" for making sure the derivatives of my Neural Network's Back Propagation are calculated properly.
The course from which I have learned, and many other sources such as this one, claim it to be much slower than calculating derivatives, but I can't seem to find anywhere that explains WHY.
So, why is gradient checking slower than calculating the derivative directly?
How much slower is it?

What you are doing in back-propagation is the backwards mode of automatic/algorithmic differentiation for a function that has a very large number N of inputs and only one output. The "inputs" here mean chiefly the real-number parameters of the nodes of the neural nets, possibly also the input variables of the net.
In the backwards mode you compute the derivatives of all inputs in one pass through the chain of operations. This has the cost of about 3 function evaluations plus the organizational overhead to execute the operations chain backwards and store and access the intermediate results.
In the forward mode for the same situation, which you use for the "gradient checking", independent of if you push forward the AD derivatives or compute divided differences, you will need to compute each derivative individually. The total cost of that is about 2*N function evaluations.
And as N is large, 2*N is much larger than 3.

Related

How to evolve weights of a neural network in Neuroevolution?

I'm new to Artificial Neural Networks and NeuroEvolution algorithms in general. I'm trying to implement the algorithm called NEAT (NeuroEvolution of Augmented Topologies), but the description in original public paper missed the method of how to evolve the weights of a network, it says
Connection weights mutate as in any NE system, with each connection either perturbed or not at each generation
I've done some searching about how to mutate weights in NE systems, but can't find any detailed description, unfortunately.
I know that while training a neural network, usually the backpropagation algorithm is used to correct the weights, but it only works if you have a fixed topology (structure) through generations and you know the answer to the problem. In NeuroEvolution, you don't know the answer, you have only the fitness function, so it's not possible to use backpropagation here.
I have some experience with training a fixed-topology NN using a genetic algorithm (What the paper refers to as the "traditional NE approach"). There are several different mutation and reproduction operators we used for this and we selected those randomly.
Given two parents, our reproduction operators (could also call these crossover operators) included:
Swap either single weights or all weights for a given neuron in the network. So for example, given two parents selected for reproduction either choose a particular weight in the network and swap the value (for our swaps we produced two offspring and then chose the one with the best fitness to survive in the next generation of the population), or choose a particular neuron in the network and swap all the weights for that neuron to produce two offspring.
swap an entire layer's weights. So given parents A and B, choose a particular layer (the same layer in both) and swap all the weights between them to produce two offsping. This is a large move so we set it up so that this operation would be selected less often than the others. Also, this may not make sense if your network only has a few layers.
Our mutation operators operated on a single network and would select a random weight and either:
completely replace it with a new random value
change the weight by some percentage. (multiply the weight by some random number between 0 and 2 - practically speaking we would tend to constrain that a bit and multiply it by a random number between 0.5 and 1.5. This has the effect of scaling the weight so that it doesn't change as radically. You could also do this kind of operation by scaling all the weights of a particular neuron.
add or subtract a random number between 0 and 1 to/from the weight.
Change the sign of a weight.
swap weights on a single neuron.
You can certainly get creative with mutation operators, you may discover something that works better for your particular problem.
IIRC, we would choose two parents from the population based on random proportional selection, then ran mutation operations on each of them and then ran these mutated parents through the reproduction operation and ran the two offspring through the fitness function to select the fittest one to go into the next generation population.
Of course, in your case since you're also evolving the topology some of these reproduction operations above won't make much sense because two selected parents could have completely different topologies. In NEAT (as I understand it) you can have connections between non-contiguous layers of the network, so for example you can have a layer 1 neuron feed another in layer 4, instead of feeding directly to layer 2. That makes swapping operations involving all the weights of a neuron more difficult - you could try to choose two neurons in the network that have the same number of weights, or just stick to swapping single weights in the network.
I know that while training a NE, usually the backpropagation algorithm is used to correct the weights
Actually, in NE backprop isn't used. It's the mutations performed by the GA that are training the network as an alternative to backprop. In our case backprop was problematic due to some "unorthodox" additions to the network which I won't go into. However, if backprop had been possible, I would have gone with that. The genetic approach to training NNs definitely seems to proceed much more slowly than backprop probably would have. Also, when using an evolutionary method for adjusting weights of the network, you start needing to tweak various parameters of the GA like crossover and mutation rates.
In NEAT, everything is done through the genetic operators. As you already know, the topology is evolved through crossover and mutation events.
The weights are evolved through mutation events. Like in any evolutionary algorithm, there is some probability that a weight is changed randomly (you can either generate a brand new number or you can e.g. add a normally distributed random number to the original weight).
Implementing NEAT might seem an easy task but there is a lot of small details that make it fairly complicated in the end. You might want to look at existing implementations and use one of them or at least be inspired by them. Everything important can be found at the NEAT Users Page.

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Appropriateness of an artificial neural network in pose estimation

I am working on a project for uni which requires markerless relative pose estimation. To do this I take two images and match n features in certain locations of the picture. From these points I can find vectors between these points which, when included with distance, can be used to estimate the new postition of the camera.
The project is required to be deplyoable on mobile devices so the algorithm needs to be efficient. A thought I had to make it more efficient would be to take these vectors and put them into a Neural Network which could take the vectors and output an estimation of the xyz movement vector based on the input.
The question I have is if a NN could be appropriate for this situation if sufficiently trained? and, if so, how would I calculate the number of hidden units I would need and what the best activation function would be?
Using a neural network for your application can very well work, however, I feel you will need a lot of training samples to allow the network to generalize. Of course, this also depends on the type and number of poses you're dealing with. It sounds to me that with some clever maths it might be possible to derive the movement vector directly from the input vector -- if by any chance you can come up with a way of doing that (or provide more information so others can think about it too), that would very much be preferred, as in that case you would include prior knowledge you have about the task instead of relying on the NN to learn it from data.
If you decide to go ahead with the NN approach, keep the following in mind:
Divide your data into training and validation set. This allows you to make sure that the network doesn't overfit. You train using the training set and determine the quality of a particular network using the error on the validation set. The ratio of training/validation depends on the amount of data you have. A large validation set (e.g., 50% of your data) will allow more precise conclusions about the quality of the trained network, but often you have too few data to afford this. However, in any case I would suggest to use at least 10% of your data for validation.
As to the number of hidden units, a rule of thumb is to have at least 10 training examples for each free parameter, i.e., each weight. So assuming you have a 3-layer network with 4 inputs, 10 hidden units, and 3 output units, where each hidden unit and the output units have additionally a bias weight, you would have (4+1) * 10 + (10+1) * 3 = 83 free parameters/weights. In general you should experiment with the number of hidden units and also the number of hidden layers. From my experience 4-layer networks (i.e., 2 hidden layers) work better than 3-layer network, but that depends on the problem. Since you also have the validation set, you can find out what network architecture and size works without having to fear overfitting.
For the activation function you should use some sigmoid function to allow for non-linear behavior. I like the hyperbolic tangent for its symmetry, but from my experience you can just as well use the logistic function.

non linear svm kernel dimension

I have some problems with understanding the kernels for non-linear SVM.
First what I understood by non-linear SVM is: using kernels the input is transformed to a very high dimension space where the transformed input can be separated by a linear hyper-plane.
Kernel for e.g: RBF:
K(x_i, x_j) = exp(-||x_i - x_j||^2/(2*sigma^2));
where x_i and x_j are two inputs. here we need to change the sigma to adapt to our problem.
(1) Say if my input dimension is d, what will be the dimension of the
transformed space?
(2) If the transformed space has a dimension of more than 10000 is it
effective to use a linear SVM there to separate the inputs?
Well it is not only a matter of increasing the dimension. That's the general mechanism but not the whole idea, if it were true that the only goal of the kernel mapping is to increase the dimension, one could conclude that all kernels functions are equivalent and they are not.
The way how the mapping is made would make possible a linear separation in the new space.
Talking about your example and just to extend a bit what greeness said, RBF kernel would order the feature space in terms of hyperspheres where an input vector would need to be close to an existing sphere in order to produce an activation.
So to answer directly your questions:
1) Note that you don't work on feature space directly. Instead, the optimization problem is solved using the inner product of the vectors in the feature space, so computationally you won't increase the dimension of the vectors.
2) It would depend on the nature of your data, having a high dimensional pattern would somehow help you to prevent overfitting but not necessarily will be linearly separable. Again, the linear separability in the new space would be achieved because the way the map is made and not only because it is in a higher dimension. In that sense, RBF would help but keep in mind that it might not perform well on generalization if your data is not locally enclosed.
The transformation usually increases the number of dimensions of your data, not necessarily very high. It depends. The RBF Kernel is one of the most popular kernel functions. It adds a "bump" around each data point. The corresponding feature space is a Hilbert space of infinite dimensions.
It's hard to tell if a transformation into 10000 dimensions is effective or not for classification without knowing the specific background of your data. However, choosing a good mapping (encoding prior knowledge + getting right complexity of function class) for your problem improves results.
For example, the MNIST database of handwritten digits contains 60K training examples and 10K test examples with 28x28 binary images.
Linear SVM has ~8.5% test error.
Polynomial SVM has ~ 1% test error.
Your question is a very natural one that almost everyone who's learned about kernel methods has asked some variant of. However, I wouldn't try to understand what's going on with a non-linear kernel in terms of the implied feature space in which the linear hyperplane is operating, because most non-trivial kernels have feature spaces that it is very difficult to visualise.
Instead, focus on understanding the kernel trick, and think of the kernels as introducing a particular form of non-linear decision boundary in input space. Because of the kernel trick, and some fairly daunting maths if you're not familiar with it, any kernel function satisfying certain properties can be viewed as operating in some feature space, but the mapping into that space is never performed. You can read the following (fairly) accessible tutorial if you're interested: from zero to Reproducing Kernel Hilbert Spaces in twelve pages or less.
Also note that because of the formulation in terms of slack variables, the hyperplane does not have to separate points exactly: there's an objective function that's being maximised which contains penalties for misclassifying instances, but some misclassification can be tolerated if the margin of the resulting classifier on most instances is better. Basically, we're optimising a classification rule according to some criteria of:
how big the margin is
the error on the training set
and the SVM formulation allows us to solve this efficiently. Whether one kernel or another is better is very application-dependent (for example, text classification and other language processing problems routinely show best performance with a linear kernel, probably due to the extreme dimensionality of the input data). There's no real substitute for trying a bunch out and seeing which one works best (and make sure the SVM hyperparameters are set properly---this talk by one of the LibSVM authors has the gory details).

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources