How to normalize neural networks for genetic algorithms? - normalization

I am trying to train a simple feedforward neural network using a genetic algorithm, however it is proving fairly inefficient because isomorphic neural networks appear different to the genetic algorithm.
It is possible to have multiple neural networks, which behave the same way, but have their neurons ordered in a different way from left to right and across levels. To the genetic algorithms those networks' genotypes will appear completely different. Therefore any attempt to do crossover is pointless and the GA ends up being as effective as hill climbing.
Can you recommend a way to normalize the networks so they appear more transparent to the genetic algorithm?

I would call crossover in this context "inefficient", rather than "pointless". One way to address the duplication you mention might be to sort the hidden layer neurons in some canonical order, and use this order during crossover, which might at least reduce the duplication encountered in hidden weight space.
Also, you might fit the output layer weights by a more direct method than genetic algorithms. You don't say what performance metric is being used, but many common measures have fairly straightforward optimizations. So, as an example, you might generate a new hidden layer using genetic operators, then fit the output layer by logistic regression, and have the GA evaluate the total network.

Related

In what circumstances might using biases in a neural network not be beneficial?

I am currently looking through Michael Nielsen's ebook Neural Networks and Deep Learning and have run the code found at the end of chapter 1 which trains a neural network to recognize hand-written digits (with a slight modification to make the backpropagation algorithm over a mini-batch matrix-based).
However, having run this code and achieving a classification accuracy of just under 94%, I decided to remove the use of biases from the network. After re-training the modified network, I found no difference in classification accuracy!
NB: The output layer of this network contains ten neurons; if the ith of these neurons has the highest activation then the input is classified as being the digit i.
This got me wondering why it is necessary to use biases in a neural network, rather than just weights, and what differentiates between a task where biases will improve the performance of a network and a task where they will not?
My code can be found here: https://github.com/pipthagoras/neural-network-1
Biases are used to account for the fact that your underlying data might not be centered. It is clearer to see in the case of a linear regression.
If you do a regression without an intercept (or bias), you are forcing the underlying model to pass through the origin, which will result in a poor model if the underlying data is not centered (for example if the true generating process is Y=3000). If, on the other hand, your data is centered or close to centered, then eliminating bias is good, since you won't introduce a term that is, in fact, independent to your predictive variable (it's like selecting a simpler model, which will tend to generalize better PROVIDED that it actually reflects the underlying data).

Why do neural networks work so well?

I understand all the computational steps of training a neural network with gradient descent using forwardprop and backprop, but I'm trying to wrap my head around why they work so much better than logistic regression.
For now all I can think of is:
A) the neural network can learn it's own parameters
B) there are many more weights than simple logistic regression thus allowing for more complex hypotheses
Can someone explain why a neural network works so well in general? I am a relative beginner.
Neural Networks can have a large number of free parameters (the weights and biases between interconnected units) and this gives them the flexibility to fit highly complex data (when trained correctly) that other models are too simple to fit. This model complexity brings with it the problems of training such a complex network and ensuring the resultant model generalises to the examples it’s trained on (typically neural networks require large volumes of training data, that other models don't).
Classically logistic regression has been limited to binary classification using a linear classifier (although multi-class classification can easily be achieved with one-vs-all, one-vs-one approaches etc. and there are kernalised variants of logistic regression that allow for non-linear classification tasks). In general therefore, logistic regression is typically applied to more simple, linearly-separable classification tasks, where small amounts of training data are available.
Models such as logistic regression and linear regression can be thought of as simple multi-layer perceptrons (check out this site for one explanation of how).
To conclude, it’s the model complexity that allows neural nets to solve more complex classification tasks, and to have a broader application (particularly when applied to raw data such as image pixel intensities etc.), but their complexity means that large volumes of training data are required and training them can be a difficult task.
Recently Dr. Naftali Tishby's idea of Information Bottleneck to explain the effectiveness of deep neural networks is making the rounds in the academic circles.
His video explaining the idea (link below) can be rather dense so I'll try to give the distilled/general form of the core idea to help build intuition
https://www.youtube.com/watch?v=XL07WEc2TRI
To ground your thinking, vizualize the MNIST task of classifying the digit in the image. For this, I am only talking about simple fully-connected neural networks (not Convolutional NN as is typically used for MNIST)
The input to a NN contains information about the output hidden inside of it. Some function is needed to transform the input to the output form. Pretty obvious.
The key difference in thinking needed to build better intuition is to think of the input as a signal with "information" in it (I won't go into information theory here). Some of this information is relevant for the task at hand (predicting the output). Think of the output as also a signal with a certain amount of "information". The neural network tries to "successively refine" and compress the input signal's information to match the desired output signal. Think of each layer as cutting away at the unneccessary parts of the input information, and
keeping and/or transforming the output information along the way through the network.
The fully-connected neural network will transform the input information into a form in the final hidden layer, such that it is linearly separable by the output layer.
This is a very high-level and fundamental interpretation of the NN, and I hope it will help you see it clearer. If there are parts you'd like me to clarify, let me know.
There are other essential pieces in Dr.Tishby's work, such as how minibatch noise helps training, and how the weights of a neural network layer can be seen as doing a random walk within the constraints of the problem.
These parts are a little more detailed, and I'd recommend first toying with neural networks and taking a course on Information Theory to help build your understanding.
Consider you have a large dataset and you want to build a binary classification model for that, Now you have two options that you have pointed out
Logistic Regression
Neural Networks ( Consider FFN for now )
Each node in a neural network will be associated with an activation function for example let's choose Sigmoid since Logistic regression also uses sigmoid internally to make decision.
Let's see how the decision of logistic regression looks when applied on the data
See some of the green spots present in the red boundary?
Now let's see the decision boundary of neural network (Forgive me for using a different color)
Why this happens? Why does the decision boundary of neural network is so flexible which gives more accurate results than Logistic regression?
or the question you asked is "Why neural networks works so well ?" is because of it's hidden units or hidden layers and their representation power.
Let me put it this way.
You have a logistic regression model and a Neural network which has say 100 neurons each of Sigmoid activation. Now each neuron will be equivalent to one logistic regression.
Now assume a hundred logistic units trained together to solve one problem versus one logistic regression model. Because of these hidden layers the decision boundary expands and yields better results.
While you are experimenting you can add more number of neurons and see how the decision boundary is changing. A logistic regression is same as a neural network with single neuron.
The above given is just an example. Neural networks can be trained to get very complex decision boundaries
Neural networks allow the person training them to algorithmically discover features, as you pointed out. However, they also allow for very general nonlinearity. If you wish, you can use polynomial terms in logistic regression to achieve some degree of nonlinearity, however, you must decide which terms you will use. That is you must decide a priori which model will work. Neural networks can discover the nonlinear model that is needed.
'Work so well' depends on the concrete scenario. Both of them do essentially the same thing: predicting.
The main difference here is neural network can have hidden nodes for concepts, if it's propperly set up (not easy), using these inputs to make the final decission.
Whereas linear regression is based on more obvious facts, and not side effects. A neural network should de able to make more accurate predictions than linear regression.
Neural networks excel at a variety of tasks, but to get an understanding of exactly why, it may be easier to take a particular task like classification and dive deeper.
In simple terms, machine learning techniques learn a function to predict which class a particular input belongs to, depending on past examples. What sets neural nets apart is their ability to construct these functions that can explain even complex patterns in the data. The heart of a neural network is an activation function like Relu, which allows it to draw some basic classification boundaries like:
Example classification boundaries of Relus
By composing hundreds of such Relus together, neural networks can create arbitrarily complex classification boundaries, for example:
Composing classification boundaries
The following article tries to explain the intuition behind how neural networks work: https://medium.com/machine-intelligence-report/how-do-neural-networks-work-57d1ab5337ce
Before you step into neural network see if you have assessed all aspects of normal regression.
Use this as a guide
and even before you discard normal regression - for curved type of dependencies - you should strongly consider kernels with SVM
Neural networks are defined with an objective and loss function. The only process that happens within a neural net is to optimize for the objective function by reducing the loss function or error. The back propagation helps in finding the optimized objective function and reach our output with an output condition.

Is Back propagation Algorithm independent algorithm

Isn't the back propagation algorithm independent algorithm or do we need any other algorithms such as Bayesian along with it for neural network learning?And do we need any probabilistic approach for implementing back propagation algorithm?
Back propagation is just an efficient way of computing gradients in computational graphs. That's all. You do not have to use it (although computing gradients without it is extremely expensive), and what you do with it is up to you - there are hundreads of ways to use gradients. The most common one is to use it in order to run first order optimization techniques (such as SGD, RMSProp or Adam). Thus to address your question - backpropagation is enough if and only if your task is to compute a gradient. For learning neural net you need at least one more piece - an actual learning algorithm (such as SGD, which is literally a single line of code). It is hard to say how "independet" it is from other methods, as as I said - gradients can be used everywhere.

Can RNN (Recurrent Neural Network) be trained as normal neural networks?

What is the difference between training a RNN and a simple neural networks? Can RNN be trained using feed forward and backward method?
Thanks ahead!
The difference is recurrence. Thus RNN cannot be easily trained as if you try to compute gradient - you will soon figure out that in order to get a gradient on n'th step - you need to actually "unroll" your network history for n-1 previous steps. This technique, known as BPTT (backpropagation through time) is exactly this - direct application of backpropagation to RNN. Unfortunately this is both computationaly expensive as well as mathematically challenging (due to vanishing/exploding gradients). People are creating workaround on many levels, by for example introduction of specific types of RNN which can be efficiently trained (LSTM, GRU), or by modification of training procedure (such as gradient clamping). To sum up - theoreticaly you can do "typical" backprop in the mathematical sense, from programming perspective - this requires more work as you need to "unroll" your network through history. This is computationaly expensive, and hard to optimize in the mathematical sense.

Machine Learning: Unsupervised Backpropagation

I'm having trouble with some of the concepts in machine learning through neural networks. One of them is backpropagation. In the weight updating equation,
delta_w = a*(t - y)*g'(h)*x
t is the "target output", which would be your class label, or something, in the case of supervised learning. But what would the "target output" be for unsupervised learning?
Can someone kindly provide an example of how you'd use BP in unsupervised learning, specifically for clustering of classification?
Thanks in advance.
The most common thing to do is train an autoencoder, where the desired outputs are equal to the inputs. This makes the network try to learn a representation that best "compresses" the input distribution.
Here's a patent describing a different approach, where the output labels are assigned randomly and then sometimes flipped based on convergence rates. It seems weird to me, but okay.
I'm not familiar with other methods that use backpropogation for clustering or other unsupervised tasks. Clustering approaches with ANNs seem to use other algorithms (example 1, example 2).
I'm not sure which unsupervised machine learning algorithm uses backpropagation specifically; if there is one I haven't heard of it. Can you point to an example?
Backpropagation is used to compute the derivatives of the error function for training an artificial neural network with respect to the weights in the network. It's named as such because the "errors" are "propagating" through the network "backwards". You need it in this case because the final error with respect to the target depends on a function of functions (of functions ... depending on how many layers in your ANN.) The derivatives allow you to then adjust the values to improve the error function, tempered by the learning rate (this is gradient descent).
In unsupervised algorithms, you don't need to do this. For example, in k-Means, where you are trying to minimize the mean squared error (MSE), you can minimize the error directly at each step given the assignments; no gradients needed. In other clustering models, such as a mixture of Gaussians, the expectation-maximization (EM) algorithm is much more powerful and accurate than any gradient-descent based method.
What you might be asking is about unsupervised feature learning and deep learning.
Feature learning is the only unsupervised method I can think of with respect of NN or its recent variant.(a variant called mixture of RBM's is there analogous to mixture of gaussians but you can build a lot of models based on the two). But basically Two models I am familiar with are RBM's(restricted boltzman machines) and Autoencoders.
Autoencoders(optionally sparse activations can be encoded in optimization function) are just feedforward neural networks which tune its weights in such a way that the output is a reconstructed input. Multiple hidden layers can be used but the weight initialization uses a greedy layer wise training for better starting point. So to answer the question the target function will be input itself.
RBM's are stochastic networks usually interpreted as graphical model which has restrictions on connections. In this setting there is no output layer and the connection between input and latent layer is bidirectional like an undirected graphical model. What it tries to learn is a distribution on inputs(observed and unobserved variables). Here also your answer would be input is the target.
Mixture of RBM's(analogous to mixture of gaussians) can be used for soft clustering or KRBM(analogous to K-means) can be used for hard clustering. Which in effect feels like learning multiple non-linear subspaces.
http://deeplearning.net/tutorial/rbm.html
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
An alternative approach is to use something like generative backpropagation. In this scenario, you train a neural network updating the weights AND the input values. The given values are used as the output values since you can compute an error value directly. This approach has been used in dimensionality reduction, matrix completion (missing value imputation) among other applications. For more information, see non-linear principal component analysis (NLPCA) and unsupervised backpropagation (UBP) which uses the idea of generative backpropagation. UBP extends NLPCA by introducing a pre-training stage. An implementation of UBP and NLPCA and unsupervised backpropagation can be found in the waffles machine learning toolkit. The documentation for UBP and NLPCA can be found using the nlpca command.
To use back-propagation for unsupervised learning it is merely necessary to set t, the target output, at each stage of the algorithm to the class for which the average distance to each element of the class before updating is least. In short we always try to train the ANN to place its input into the class whose members are most similar in terms of our input. Because this process is sensitive to input scale it is necessary to first normalize the input data in each dimension by subtracting the average and dividing by the standard deviation for each component in order to calculate the distance in a scale-invariant manner.
The advantage to using a back-prop neural network rather than a simple distance from a center definition of the clusters is that neural networks can allow for more complex and irregular boundaries between clusters.

Resources