How is Growing Neural Gas used for clustering? - machine-learning

I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?

NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.

Related

Neural Network - Should I Remove All Derived / Calculated Variables?

I'm using a neural network to control the movement of a character in a game. I've currently got a huge amount of dimensions and in the interest of trimming them to improve storage and code manageability, I'm considering removing all derived variables i.e. any variable which can be calculated from data already sent into to the network.
An example of this would be the relationship between a) position, b) velocity, and c) acceleration along a path. Currently, I send the last 50 data points of all three to the NN to help it decide its next movement. However, I wonder if system control / error could be minimized just as easily by sending only position. Theoretically the neural network should be able to derive the velocity and acceleration at a point in time entirely on it's own given the position history.
Generally, is dimension reduction in this capacity recommended? Why or why not?
I know the oft recommendation in this scenario is just to test it and see what happens, but in this case there are so many variables here that it would take days to test, so I was hoping to hear anyone's experience given this type of situation and what they surmise the general rule to be.
Bonus question--would this assessment / decision be different for a neural network (intent on mapping functions to data) as opposed to a random forest (seems to use more of a nearest neighbor approach).
Thanks!!
Implement PCA to reduce the number of features. They reduced features will have unusual units like [positionvelocityacceleration]. However, if you do PCA correctly you can retain a feature set that has 99% variance of the original set.
Then use the new feature set in your NN.
Reducing dimensions is recommended to speed-up algorithms because, as you observed, there is a lot of similarity between your features.

an algorithm for clustering visually separable clusters

I have visualized a dataset in 2D after employing PCA. 1 dimension is time and the Y dimension is First PCA component. As figure shows, there is relatively good separation between points (A, B). But unfortunately clustering methods (DBSCAN, SMO, KMEANS, Hierarchical) are not able to cluster these points in 2 clusters. As you see in section A there is a relative continuity and this continuous process is finished and Section B starts and there is rather big gap in comparison to past data between A and B.
I will be so grateful if you can introduce me any method and algorithm (or devising any metric from data considering its distribution) to be able to do separation between A and B without visualization. Thank you so much.
This is plot of 2 PCA components for the above plot(the first one). The other one is also the plot of components of other dataset which I get bad result,too.
This is a time series, and apparently you are looking for change points or want to segment this time series.
Do not treat this data set as a two dimensional x-y data set, and don't use clustering here; rather choose an algorithm that is actually designed for time series.
As a starter, plot series[x] - series[x-1], i.e. the first derivative. You may need to remove seasonality to improve results. No clustering algorithm will do this, they do not have a notion of seasonality or time.
If PCA gives you a good separation, you can just try to cluster after projecting your data through your PCA eigenvectors. If you don't want to use PCA, then you will need anyway an alternative data projection method, because failing clustering methods imply that your data is not separable in the original dimensions. You can take a look at non linear clustering methods such as the kernel based ones or spectral clustering for example. Or to define your own non-euclidian metric, which is in fact just another data projection method.
But using PCA clearly seems to be the best fit in your case (Occam razor : use the simplest model that fits your data).
I don't know that you'll have an easy time devising an algorithm to handle this case, which is dangerously (by present capabilities) close to "read my mind" clustering. You have a significant alley where you've marked the division. You have one nearly as good around (1700, +1/3), and an isolate near (1850, 0.45). These will make it hard to convince a general-use algorithm to make exactly one division at the spot you want, although that one is (I think) still the most computationally obvious.
Spectral clustering works well at finding gaps; I'd try that first. You might have to ask it for 3 or 4 clusters to separate the one you want in general. You could also try playing with SVM (good at finding alleys in data), but doing that in an unsupervised context is the tricky part.
No, KMeans is not going to work; it isn't sensitive to density or connectivity.

How to evolve weights of a neural network in Neuroevolution?

I'm new to Artificial Neural Networks and NeuroEvolution algorithms in general. I'm trying to implement the algorithm called NEAT (NeuroEvolution of Augmented Topologies), but the description in original public paper missed the method of how to evolve the weights of a network, it says
Connection weights mutate as in any NE system, with each connection either perturbed or not at each generation
I've done some searching about how to mutate weights in NE systems, but can't find any detailed description, unfortunately.
I know that while training a neural network, usually the backpropagation algorithm is used to correct the weights, but it only works if you have a fixed topology (structure) through generations and you know the answer to the problem. In NeuroEvolution, you don't know the answer, you have only the fitness function, so it's not possible to use backpropagation here.
I have some experience with training a fixed-topology NN using a genetic algorithm (What the paper refers to as the "traditional NE approach"). There are several different mutation and reproduction operators we used for this and we selected those randomly.
Given two parents, our reproduction operators (could also call these crossover operators) included:
Swap either single weights or all weights for a given neuron in the network. So for example, given two parents selected for reproduction either choose a particular weight in the network and swap the value (for our swaps we produced two offspring and then chose the one with the best fitness to survive in the next generation of the population), or choose a particular neuron in the network and swap all the weights for that neuron to produce two offspring.
swap an entire layer's weights. So given parents A and B, choose a particular layer (the same layer in both) and swap all the weights between them to produce two offsping. This is a large move so we set it up so that this operation would be selected less often than the others. Also, this may not make sense if your network only has a few layers.
Our mutation operators operated on a single network and would select a random weight and either:
completely replace it with a new random value
change the weight by some percentage. (multiply the weight by some random number between 0 and 2 - practically speaking we would tend to constrain that a bit and multiply it by a random number between 0.5 and 1.5. This has the effect of scaling the weight so that it doesn't change as radically. You could also do this kind of operation by scaling all the weights of a particular neuron.
add or subtract a random number between 0 and 1 to/from the weight.
Change the sign of a weight.
swap weights on a single neuron.
You can certainly get creative with mutation operators, you may discover something that works better for your particular problem.
IIRC, we would choose two parents from the population based on random proportional selection, then ran mutation operations on each of them and then ran these mutated parents through the reproduction operation and ran the two offspring through the fitness function to select the fittest one to go into the next generation population.
Of course, in your case since you're also evolving the topology some of these reproduction operations above won't make much sense because two selected parents could have completely different topologies. In NEAT (as I understand it) you can have connections between non-contiguous layers of the network, so for example you can have a layer 1 neuron feed another in layer 4, instead of feeding directly to layer 2. That makes swapping operations involving all the weights of a neuron more difficult - you could try to choose two neurons in the network that have the same number of weights, or just stick to swapping single weights in the network.
I know that while training a NE, usually the backpropagation algorithm is used to correct the weights
Actually, in NE backprop isn't used. It's the mutations performed by the GA that are training the network as an alternative to backprop. In our case backprop was problematic due to some "unorthodox" additions to the network which I won't go into. However, if backprop had been possible, I would have gone with that. The genetic approach to training NNs definitely seems to proceed much more slowly than backprop probably would have. Also, when using an evolutionary method for adjusting weights of the network, you start needing to tweak various parameters of the GA like crossover and mutation rates.
In NEAT, everything is done through the genetic operators. As you already know, the topology is evolved through crossover and mutation events.
The weights are evolved through mutation events. Like in any evolutionary algorithm, there is some probability that a weight is changed randomly (you can either generate a brand new number or you can e.g. add a normally distributed random number to the original weight).
Implementing NEAT might seem an easy task but there is a lot of small details that make it fairly complicated in the end. You might want to look at existing implementations and use one of them or at least be inspired by them. Everything important can be found at the NEAT Users Page.

What significance does an activation pattern hold for SOMs?

SOM - Self Organized Map, every input dimension maps to all output nodes, nodes compete with each other for scoring - vector quantization. PCA and other clustering methods can be seen as simplified special cases of this process.
There is only ever a single winning node in a SOM. However, what happens when an input strongly resembles two established 'clusters'? Could it so happen that the first neuron wins over a second neuron by a small margin and yet the two are very far apart? If so, would it not also be extremely useful information?
If so, then it means the entire activation pattern with all its various outputs would be useful in classifying an input.
The reason I'm asking is because I'm considering plugging SOMs into other neural networks and then maybe back again into SOMs. And when plugging in, I wish to know if it would be safe to just carry over the entire lattice with all its outputs instead of just the winning node.
I have tried checking the math of the SOM, when training it only considers the winning neuron, but nothing seems to indicate that if a new input is used, only the winning node is of importance to the operator.
The goal of the algorithm at the end of training is to have the first and second winning nodes of each input pattern in adjacent positions in the lattice. This is referred as Topology Preservation of the input data space. The inverse case is considered as bad training and is calculated by the topological error. One simple measure of this error is the ratio of input vectors for which the first and second winning nodes are not adjacent.
Search for SOM and topology preservation.
Here is a quick link .
Keep in mind that small maps generally produce a smaller topological error but increased quantization error where larger maps tend to inverse this situation. So there is a trade of between topology preservation and quantization accuracy. There isn't a golden rule for this. It always depends on the domain, the application and the expected results.

Number of hidden layers in a neural network model

Would someone be able to explain to me or point me to some resources of why (or situations where) more than one hidden layer would be necessary or useful in a neural network?
Basically more layers allow more functions to be represented. The standard book for AI courses, "Artificial Intelligence, A Modern Approach" by Russell and Norvig, goes into some detail of why multiple layers matter in Chapter 20.
One important point is that with a sufficiently large single hidden layer, you can represent every continuous function, but you will need at least 2 layers to be able to represent every discontinuous function.
In practice, though, a single layer is enough at least 99% of the time.
That's more similar to the way the brain works (which might not necessarily be a computational advantage, but a lot of people are researching NN to gain insight about the way the mind works, rather than to solve real world problems.
Its easier to achieve some kinds of invariance using more layers. For example, an image classifier that works regardless of where in the image the object is found, or the object's size. see Bouvrie, J. , L. Rosasco, and T. Poggio. "On Invariance in Hierarchical Models". Advances in Neural Information Processing Systems (NIPS) 22, 2009.
Each layer effectively raises the potential "complexity" of adaptation in an exponential fashion (as opposed to a multiplicative fashion of adding more nodes to a single layer).

Resources