How to evolve weights of a neural network in Neuroevolution? - machine-learning

I'm new to Artificial Neural Networks and NeuroEvolution algorithms in general. I'm trying to implement the algorithm called NEAT (NeuroEvolution of Augmented Topologies), but the description in original public paper missed the method of how to evolve the weights of a network, it says
Connection weights mutate as in any NE system, with each connection either perturbed or not at each generation
I've done some searching about how to mutate weights in NE systems, but can't find any detailed description, unfortunately.
I know that while training a neural network, usually the backpropagation algorithm is used to correct the weights, but it only works if you have a fixed topology (structure) through generations and you know the answer to the problem. In NeuroEvolution, you don't know the answer, you have only the fitness function, so it's not possible to use backpropagation here.

I have some experience with training a fixed-topology NN using a genetic algorithm (What the paper refers to as the "traditional NE approach"). There are several different mutation and reproduction operators we used for this and we selected those randomly.
Given two parents, our reproduction operators (could also call these crossover operators) included:
Swap either single weights or all weights for a given neuron in the network. So for example, given two parents selected for reproduction either choose a particular weight in the network and swap the value (for our swaps we produced two offspring and then chose the one with the best fitness to survive in the next generation of the population), or choose a particular neuron in the network and swap all the weights for that neuron to produce two offspring.
swap an entire layer's weights. So given parents A and B, choose a particular layer (the same layer in both) and swap all the weights between them to produce two offsping. This is a large move so we set it up so that this operation would be selected less often than the others. Also, this may not make sense if your network only has a few layers.
Our mutation operators operated on a single network and would select a random weight and either:
completely replace it with a new random value
change the weight by some percentage. (multiply the weight by some random number between 0 and 2 - practically speaking we would tend to constrain that a bit and multiply it by a random number between 0.5 and 1.5. This has the effect of scaling the weight so that it doesn't change as radically. You could also do this kind of operation by scaling all the weights of a particular neuron.
add or subtract a random number between 0 and 1 to/from the weight.
Change the sign of a weight.
swap weights on a single neuron.
You can certainly get creative with mutation operators, you may discover something that works better for your particular problem.
IIRC, we would choose two parents from the population based on random proportional selection, then ran mutation operations on each of them and then ran these mutated parents through the reproduction operation and ran the two offspring through the fitness function to select the fittest one to go into the next generation population.
Of course, in your case since you're also evolving the topology some of these reproduction operations above won't make much sense because two selected parents could have completely different topologies. In NEAT (as I understand it) you can have connections between non-contiguous layers of the network, so for example you can have a layer 1 neuron feed another in layer 4, instead of feeding directly to layer 2. That makes swapping operations involving all the weights of a neuron more difficult - you could try to choose two neurons in the network that have the same number of weights, or just stick to swapping single weights in the network.
I know that while training a NE, usually the backpropagation algorithm is used to correct the weights
Actually, in NE backprop isn't used. It's the mutations performed by the GA that are training the network as an alternative to backprop. In our case backprop was problematic due to some "unorthodox" additions to the network which I won't go into. However, if backprop had been possible, I would have gone with that. The genetic approach to training NNs definitely seems to proceed much more slowly than backprop probably would have. Also, when using an evolutionary method for adjusting weights of the network, you start needing to tweak various parameters of the GA like crossover and mutation rates.

In NEAT, everything is done through the genetic operators. As you already know, the topology is evolved through crossover and mutation events.
The weights are evolved through mutation events. Like in any evolutionary algorithm, there is some probability that a weight is changed randomly (you can either generate a brand new number or you can e.g. add a normally distributed random number to the original weight).
Implementing NEAT might seem an easy task but there is a lot of small details that make it fairly complicated in the end. You might want to look at existing implementations and use one of them or at least be inspired by them. Everything important can be found at the NEAT Users Page.

Related

Instance Normalisation vs Batch normalisation

I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.

extrapolation with recurrent neural network

I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Echo state neural network?

Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources