learning estimated value AND expected temporal-difference error - machine-learning

How could I best let my network learn not only the expected value but also the expected variation around that value, a measure of uncertainty. For any state the network has never seen before this would be very high, for any state that the network has seen many times it should approach some estimate of the expected variation.
Wondering if one can "learn" both aspects at the same time with a (potentially partially) overlapping network.

Related

Regarding the consistency of convolutional neural networks

I am currently building a 2-channel (also called double-channel) convolutional neural network in order to classify 2 binary images (containing binary objects) as 'similar' or 'different'.
The problem I am having is that it seems as though the network doesn't always converge to the same solution. For example, I can use exactly the same ordering of training pairs and all the same parameters and so forth, and when I run the network multiple times, each time produces a different solution; sometimes converging to below 2% error rates, and other times I get 50% error rates.
I have a feeling that it has something to do with the random initialization of the weights of the network, which results in different optimization paths each time the network is executed. This issue even occurs when I use SGD with momentum, so I don't really know how to 'force' the network to converge to the same solution (global optima) every time?
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
There are several sources of randomness in training.
Initialization is one. SGD itself is of course stochastic since the content of the minibatches is often random. Sometimes, layers like dropout are inherently random too. The only way to ensure getting identical results is to fix the random seed for all of them.
Given all these sources of randomness and a model with many millions of parameters, your quote
"I don't really know how to 'force' the network to converge to the same solution (global optima) every time?"
is something pretty much something anyone should say - no one knows how to find the same solution every time, or even a local optima, let alone the global optima.
Nevertheless, ideally, it is desirable to have the network perform similarly across training attempts (with fixed hyper-parameters and dataset). Anything else is going to cause problems in reproducibility, of course.
Unfortunately, I suspect the problem is inherent to CNNs.
You may be aware of the bias-variance tradeoff. For a powerful model like a CNN, the bias is likely to be low, but the variance very high. In other words, CNNs are sensitive to data noise, initialization, and hyper-parameters. Hence, it's not so surprising that training the same model multiple times yields very different results. (I also get this phenomenon, with performances changing between training runs by as much as 30% in one project I did.) My main suggestion to reduce this is stronger regularization.
Can this have something to do with the fact that I am using binary images instead of grey-scale or color images, or is there something intrinsic to neural networks that is causing this issue?
As I mentioned, this problem is present inherently for deep models to an extent. However, your use of binary images may also be a factor, since the space of the data itself is rather discontinuous. Perhaps consider "softening" the input (e.g. filtering the inputs) and using data augmentation. A similar approach is known to help in label smoothing, for example.

Accurate general description of Regression versus Classification

So I have the following problem: I realized (while writing my master thesis) that I am still not sure/have vague descriptions of some of the machine learning principles.
For instance, I vaguely remember that at some point I heard the following description:
The output (label) of a classification task is discrete and finite while the output (label) of a regression task is continuous and can be infinite
The one word that I am unsure of is infinite for regression in this description.
For instance, if you assume that (for whatever reason) you have 2D data points that are almost distributed like a sine wave (with some noise) and you use polyfit to fit a polynomial of k-degree on it (see Figure here here k = 8). Now you have some data in a specific range, e.g., here the range of available points in the x-direction is [0,12], which is used to fit the polynomial.
However wouldn't you be able to quickly get the y-result for the value x = 1M (or an arbitrarily large number), as you have the general shape of the polynomial? Is that not what infinite labels mean?
Maybe I am just wrongly remembering stuff that I learned years ago ;).
best regards
First of all, this is a question more fitting for the more theoretically inclined sites of StackExchange, like Stats Stackexchange Math Stackexchange, or the Data Science Stackexchange, which conveniently also provide answers to your question.
But not quite. In any case, your problem seems to be on the distinction between input and output. The type of task (i.e. either classificaiton or regression) is solely based on the output of your model, but has nothing to do with the input.
You could have a ton of "continuous input variables" (or even a mixture with distinct ones), and still call it a classification task, if it has a distinct amount of output values.
Furthermore, the infinite simply refers to the fact that these values are not bounded, i.e. you cannot restrict your regression task to a specific range easily. If you suddenly input a value completely outside of your training value range (as with your example), you will likely get an "infinite" y value, since your network will only be trained on this specific range; a problem that also happens with polynomial fitting, as the following example shows:
The red line could be the learned function for your network, so if you suddenly go far beyond known values, you likely get some extreme value (unless you train very well).
Opposed to that, the classification network would still predict any of the given classes. I like to imagine it kind of a Voronoi diagram: Even if your point is arbitrarily far from any of the previous points, it will still belong to some category.

Purposely Overfit Neural Network

Technically speaking, given a complex enough network and sufficient amounts of time, is it always possible to overfit any dataset to the point where training error is 0?
Neural networks are universal approximators, which pretty much means that as long as there exists a deterministic mapping f from input to output, there always exists a set of parameters (for large enough network) that give you error which is arbitrarly close to minimal possible error, but:
if dataset is infinite (it is a distribution) then minimal obtainable error (called Bayes risk) can be greater than zero, bur rather some value e (pretty much the measure of "overlap" of different classes/value).
if mapping f is non-deterministic then again there is a non-zero Bayes risk e (this is a mathematical way of saying that a given point can have "multiple" values, with given probabilities)
arbitrarly close does not mean minimal. So even if the minimal error is zero, it does not mean that you just need "big enough" network to get to zero, you might always end up with veeeery small epsilon (but you can decrease it as long as you want). For example a network trained on classification task which has sigmoid/softmax output cannot ever obtain minimal log loss (cross entropy loss), as you can always move your activations "closer to 1" or "closer to 0", but you cannot achieve neither of these.
So from mathematical perspective the answer is no, from practical point of view - under the assumption of finite training set and deterministic mapping - the answer is yes.
In particular when you are asking about accuracy of the classification, and you have finite dataset with unique label per datapoint then it is easy to construct by hand a neural network which has 100% accuracy. However this does not mean minimal possible loss (as described above). Thus from the optimization perspective you are not obtaining "zero error".

What significance does an activation pattern hold for SOMs?

SOM - Self Organized Map, every input dimension maps to all output nodes, nodes compete with each other for scoring - vector quantization. PCA and other clustering methods can be seen as simplified special cases of this process.
There is only ever a single winning node in a SOM. However, what happens when an input strongly resembles two established 'clusters'? Could it so happen that the first neuron wins over a second neuron by a small margin and yet the two are very far apart? If so, would it not also be extremely useful information?
If so, then it means the entire activation pattern with all its various outputs would be useful in classifying an input.
The reason I'm asking is because I'm considering plugging SOMs into other neural networks and then maybe back again into SOMs. And when plugging in, I wish to know if it would be safe to just carry over the entire lattice with all its outputs instead of just the winning node.
I have tried checking the math of the SOM, when training it only considers the winning neuron, but nothing seems to indicate that if a new input is used, only the winning node is of importance to the operator.
The goal of the algorithm at the end of training is to have the first and second winning nodes of each input pattern in adjacent positions in the lattice. This is referred as Topology Preservation of the input data space. The inverse case is considered as bad training and is calculated by the topological error. One simple measure of this error is the ratio of input vectors for which the first and second winning nodes are not adjacent.
Search for SOM and topology preservation.
Here is a quick link .
Keep in mind that small maps generally produce a smaller topological error but increased quantization error where larger maps tend to inverse this situation. So there is a trade of between topology preservation and quantization accuracy. There isn't a golden rule for this. It always depends on the domain, the application and the expected results.

Neural Network Outputs Are Not Changing Very Much

I have 20 output neurons on a feed-forward neural network, for which I have already tried varying the number of hidden layers and number of neurons per hidden layer. When testing, I've noticed that while the outputs are not always exactly the same, they vary from test case to case very little, especially in respect to one another. It seems to be outputting nearly (within 0.0005 depending on the initial weights) the same output on every test case; the one that is the highest is always the highest. Is there a reason for this?
Note: I'm using a feed-forward neural network, with resilient and common backpropagation, separating training/validation/testing and shuffling in between training sets.
UPDATE: I'm using the network to categorize patterns from 4 inputs into one of twenty output possibilities. I have 5000 training sets, 800 validation sets, and 1500 testing sets. Number of rounds can vary depending on what I'm doing, on my current training case, the training error seems to converge too quickly (under 20 epochs). However, I have noticed this non-variance at other times when the error will decrease over a period of 1000 epochs. I have also adjusted the learning rate and momentum for the regular propagation. Resilient propagation does not use a learning rate or momentum for updates. This is being implemented using Encog.
Your dataset seems problematic to begin with. 20 outputs for 4 inputs seem too many. The number of output is generally much smaller than the number of inputs. Most probably, either the dataset is wrongly formulated, or you have misunderstood something in the problem you are trying to solve. Anyway, some things regarding your other comments:
First of all, you don't use 1500 training sets, but one set with 1500 training patterns. The same goes for validation and testing.
Second, the output can't be exactly the same on each run, since the weights are initialized randomly and the outputs depend on them. However, we want them to be similar on each run. If they weren't it would mean that they depend too much on the random initialization, so the network wouldn't work well.
In your case, the highest output is the selected category, so if the same output is the highest every time your network is working well.
If the network output is almost the same for different input patterns, the network is unable to categorize input well.
You say your network has 4 input nodes and 20 output nodes (right?). So there are 2*2*2*2 = 16 different possible input patterns. Why the hell you need 800 validation sets?
Your training data may be corrupt.

Resources