My goal is to solve the XOR problem using a Neural Network. I’ve read countless articles on the theory, proof, and mathematics behind a multi-layered neural network. The theory make sense (math… not so much) but I have a few simple questions regarding the evaluation and topology of a Neural Network.
I feel I am very close to solving this problem, but I am beginning to question my topology and evaluation techniques. The complexities of back propagation aside, I just want to know if my approach to evaluation is correct. With that in mind, here are my questions:
Assuming we have multiple inputs, does each respective input get its’ own node? Do we ever input both values into a single node? Does the order in which we enter this information matter?
While evaluating the graph output, does each node fire as soon as it gets a value? Or do we instead collect all the values from the above layer and then fire off once we’ve consumed all the input?
Does the order of evaluation matter? For example, if a given node in layer “b” is ready to fire – but other nodes in that same layer are still awaiting input – should the ready node fire anyway? Or should all nodes in the layer be loaded up before firing?
Should each layer be connected to all nodes in the following layer?
I’ve attached a picture which should help explain (some of) my questions.
Thank you for your time!
1) Yes, each input gets its own node, and that node is always the node for that input type. The order doesn't matter - you just need to keep it consistent. After all, an untrained neural net can learn to map any set of linearly separable inputs to outputs, so there can't be an order that you need to put the nodes in in order for it to work.
2 and 3) You need to collect all the values from a single layer before any node in the next layer fires. This is important if you're using any activation function other than a stepwise one, because the sum of the inputs will affect the value that is propagated forward. Thus, you need to know what that sum is before you propagate anything.
4) Which nodes to connect to which other nodes is up to you. Since your net won't be excessively large and XOR is a fairly straightforward problem, it will probably be simplest for you to connect all nodes in one layer to all nodes in the next layer (i.e. a fully-connected neural net). There might be specialized cases in other problems where it would be better to not use this topology, but there isn't an easy way to figure it out (most people either use trial and error or a genetic algorithm, as in NEAT), and you don't need to worry about it for the purposes of this problem.
Related
I am new to deep learning and trying to understand the concept behind hidden layers, but i am not clear with following things:
If there are suppose 3 hidden layers. When we take output from all the nodes of 2nd layer as input to all the nodes of 3rd layer then what difference it makes in output of nodes of 3rd layer as they are getting same input + same parameters initialization (as per what I read, I assume that all the nodes of one layer gets same random weight for parameters).
Please correct me if I am thinking in wrong direction.
The simple answer is because of random initialization.
If you started with same weights through out the neural network (NN), then all nodes will produce the same output.
This is because when using backprop algorithm the error is spread out based on the activation strength of each node. If they start out the same then the error will spread equally and hence the nodes in the NN will not be able to learn different features.
So basic random initialization makes sure that each node specializes. Hence after learning, the nodes in the hidden layers will produce different outputs even when the input is the same.
Hope this helps.
I don’t understand how the NEAT algorithm takes inputs and then outputs numbers based on the connection genes, I am familiar with using matrixes in fixed topology neural networks to feedforward inputs, however as each node in NEAT has its own number of connections and isn’t necessarily connected to every other node, I don’t understand, and after much searching I can’t find an answer on how NEAT produces outputs based on the inputs.
Could someone explain how it works?
That was also a question I struggled while implementing my own version of the algorithm.
You can find the answer in the NEAT Users Page: https://www.cs.ucf.edu/~kstanley/neat.html where the author says:
How are networks with arbitrary topologies activated?
The activation function, bool Network::activate(), gives the specifics. The
implementation is of course considerably different than for a simple layered
feedforward network. Each node adds up the activation from all incoming nodes
from the previous timestep. (The function also handles a special "time delayed"
connection, but that is not used by the current version of NEAT in any
experiments that we have published.) Another way to understand it is to realize
that activation does not travel all the way from the input layer to the output
layer in a single timestep. In a single timestep, activation only travels from
one neuron to the next. So it takes several timesteps for activation to get from
the inputs to the outputs. If you think about it, this is the way it works in a
real brain, where it takes time for a signal hitting your eyes to get to the
cortex because it travels over several neural connections.
So, if one of the evolved networks is not feedforward, the outputs of the network will change in different timesteps and this is particularly useful in continuous control problems, where the environment is not static, but also problematic in classification problems. The author also answers:
How do I ensure that a network stabilizes before taking its output(s) for a
classification problem?
The cheap and dirty way to do this is just to activate n times in a row where
n>1, and hope there are not too many loops or long pathways of hidden nodes.
The proper (and quite nice) way to do it is to check every hidden node and output
node from one timestep to the next, and see if nothing has changed, or at least
not changed within some delta. Once this criterion is met, the output must be
stable.
Note that output may not always stabilize in some cases. Also, for continuous
control problems, do not check for stabilization as the network never "settles"
but rather continuously reacts to a changing environment. Generally,
stabilization is used in classification problems, or in board games.
when I was dealing with this I researched into loop detection using matrix methods etc.
https://en.wikipedia.org/wiki/Adjacency_matrix#Matrix_powers
But I found the best way to feedforward inputs and get outputs was with loop detection using a timeout propagation delay at each node:
a feedforward implementation is simple and I started from there:
wait until all incoming connections to a node have a signal then sum-squash activate and send to all output connections of that node. Start from input nodes that already have a signal from the input vector. Manually 'shunt' output nodes with a sum-squash operation once there are no more nodes to be processed to get the output vector.
for circularity (traditional NEAT implementation) I did the same as feedforward with one more feature:
calculate the 'maximum possible loop size' of the network. an easy way to calculate this is ~2*(total number of nodes). No walk from input to any node in the network is larger than this without cycling, therefore the node MUST propagate in this many time steps unless it is part of a cycle.
Then I wait until all input connection signals arrive at a node OR timeout occurs (signal has not arrived at a connection within maximum loop size steps). If timeout occurs label the input connections that don't have signals as recurrent.
Once a connection is labelled recurrent, restart all timers on all nodes (to prevent a node later in the detected cycle from being labelled recurrent due to propagation latency)
Now forward propagation is the same as feed forward network except: don't wait for connections that are recurrent, sum-squash as soon as all non-recurrent connections have arrived (0 for recurrent connections that don't have a signal yet). This ensures that the first node reached in a cycle is set to recurrent, making it deterministic for any given topology and recurrent connections pass data to the next propagation time step.
This has some first time overhead but is concise and produces the same results with a given topology each time its ran. Note that this only works when all nodes have a path to output so you cant necessarily disable split connections (connections that were made from node addition operations) and prune randomly during evolution without making considerations.
(P.S. This also creates a traditional residual-recurrent network that in theory could be implemented as matrix operations trivially. If I had large networks I would first 'express' by running forward propagation once to get recurrent connections then create a 'tensor per layer' representation for matrix-multiplication operations using recurrent, weight, and signal connection attributes with recurrent connection attribute as a sparse binary mask. I actually started writing a Tensorflow implementation that performed all mutation/augmentation operations with tf.sparse_matrix operations and didn't use any tree objects but I had to use dense operations and the n^2 space consumed is too much for what I need but this allowed the use of the aforementioned adjacency matrix powers trick since in matrix form! At least one other person on Github has done tf NEAT but I'm unsure of their implementation. Also I found this interesting https://neat-python.readthedocs.io/en/latest/neat_overview.html)
Happy Hacking!
I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?
NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.
I want to implement a perceptron network and I have a little problem. The first implementation will be very simple. Just three layers, input, one hidden and the output layer. My problem is that how many synapses is optimal for a hidden node beetwen the input and hidden layer? I think not too economical if every node join to every input nodes.
Thanks for the comments.
In very general setting you plug in every single node of a given layer with every node in the next one. This is called "fully connected layer". Obviously this is not the only option, and with more advanced approaches you will find much more sparse connectomes, like receptive fields, cvonolutional layers etc. For a simple experiments, starting with fully connected layers is preferable, as other connection strategies usually assume something about your data (like spatial-temporal relations of inputs), while fully connected layer is the agnostic, generic approach.
I'm new to Artificial Neural Networks and NeuroEvolution algorithms in general. I'm trying to implement the algorithm called NEAT (NeuroEvolution of Augmented Topologies), but the description in original public paper missed the method of how to evolve the weights of a network, it says
Connection weights mutate as in any NE system, with each connection either perturbed or not at each generation
I've done some searching about how to mutate weights in NE systems, but can't find any detailed description, unfortunately.
I know that while training a neural network, usually the backpropagation algorithm is used to correct the weights, but it only works if you have a fixed topology (structure) through generations and you know the answer to the problem. In NeuroEvolution, you don't know the answer, you have only the fitness function, so it's not possible to use backpropagation here.
I have some experience with training a fixed-topology NN using a genetic algorithm (What the paper refers to as the "traditional NE approach"). There are several different mutation and reproduction operators we used for this and we selected those randomly.
Given two parents, our reproduction operators (could also call these crossover operators) included:
Swap either single weights or all weights for a given neuron in the network. So for example, given two parents selected for reproduction either choose a particular weight in the network and swap the value (for our swaps we produced two offspring and then chose the one with the best fitness to survive in the next generation of the population), or choose a particular neuron in the network and swap all the weights for that neuron to produce two offspring.
swap an entire layer's weights. So given parents A and B, choose a particular layer (the same layer in both) and swap all the weights between them to produce two offsping. This is a large move so we set it up so that this operation would be selected less often than the others. Also, this may not make sense if your network only has a few layers.
Our mutation operators operated on a single network and would select a random weight and either:
completely replace it with a new random value
change the weight by some percentage. (multiply the weight by some random number between 0 and 2 - practically speaking we would tend to constrain that a bit and multiply it by a random number between 0.5 and 1.5. This has the effect of scaling the weight so that it doesn't change as radically. You could also do this kind of operation by scaling all the weights of a particular neuron.
add or subtract a random number between 0 and 1 to/from the weight.
Change the sign of a weight.
swap weights on a single neuron.
You can certainly get creative with mutation operators, you may discover something that works better for your particular problem.
IIRC, we would choose two parents from the population based on random proportional selection, then ran mutation operations on each of them and then ran these mutated parents through the reproduction operation and ran the two offspring through the fitness function to select the fittest one to go into the next generation population.
Of course, in your case since you're also evolving the topology some of these reproduction operations above won't make much sense because two selected parents could have completely different topologies. In NEAT (as I understand it) you can have connections between non-contiguous layers of the network, so for example you can have a layer 1 neuron feed another in layer 4, instead of feeding directly to layer 2. That makes swapping operations involving all the weights of a neuron more difficult - you could try to choose two neurons in the network that have the same number of weights, or just stick to swapping single weights in the network.
I know that while training a NE, usually the backpropagation algorithm is used to correct the weights
Actually, in NE backprop isn't used. It's the mutations performed by the GA that are training the network as an alternative to backprop. In our case backprop was problematic due to some "unorthodox" additions to the network which I won't go into. However, if backprop had been possible, I would have gone with that. The genetic approach to training NNs definitely seems to proceed much more slowly than backprop probably would have. Also, when using an evolutionary method for adjusting weights of the network, you start needing to tweak various parameters of the GA like crossover and mutation rates.
In NEAT, everything is done through the genetic operators. As you already know, the topology is evolved through crossover and mutation events.
The weights are evolved through mutation events. Like in any evolutionary algorithm, there is some probability that a weight is changed randomly (you can either generate a brand new number or you can e.g. add a normally distributed random number to the original weight).
Implementing NEAT might seem an easy task but there is a lot of small details that make it fairly complicated in the end. You might want to look at existing implementations and use one of them or at least be inspired by them. Everything important can be found at the NEAT Users Page.