How to propagate uncertainty into the prediction of a neural network? - machine-learning

I have inputs x_1, ..., x_n that have known 1-sigma uncertainties e_1, ..., e_n. I am using them to predict outputs y_1, ..., y_m on a trained neural network. How can I obtain 1-sigma uncertainties on my predictions?
My idea is to randomly perturb each input x_i with normal noise having mean 0 and standard deviation e_i a large number of times (say, 10000), and then take the median and standard deviation of each prediction y_i. Does this work?
I fear that this only takes into account the "random" error (from the measurements) and not the "systematic" error (from the network), i.e., each prediction inherently has some error to it that is not being considered in this approach. How can I properly obtain 1-sigma error bars on my predictions?

You can get a general analysis of what "jittering" (generation of random samples) brings to the neural network optimization here http://wojciechczarnecki.com/pdfs/preprint-ml-with-unc.pdf
In short - jittering is just a regularization on network's weights.
For errors bars as such you should refer to works of Will Penny
http://www.fil.ion.ucl.ac.uk/~wpenny/publications/error_bars.ps
http://www.fil.ion.ucl.ac.uk/~wpenny/publications/nnerrors.ps

u r right. That method only takes the data uncertainty into account (assuming u don't fit the neural net while applying the noise). As a side note, alternatively when fitting the data using a neural net u may also apply mixture density networks (see one of the many tutorials).
More importantly, in order to account for model uncertainty u should apply bayesian neural nets. U could could start e.g. with Monte-Carlo dropout. Also very interesting should be this work on performing sampling-free inference when using Monte-Carlo dropout
https://arxiv.org/abs/1908.00598
This work explicitly uses error propagation through neural networks and should be very interesting for u!
Best

Related

How does training a neural network using a metaheuristic like PSO work?

I want to learn a bit more about other approaches to training neural networks and I can find a fair bit of literature on GA training a network but not much on PSO training. How does that work?
I have a general idea: you create a swarm of so many particles and use the network loss function (eg MSE) as a heuristic. Particles will move to areas where the MSE is lowest and then you have your weights for the network.
I understand for an online vanilla back-propagation network, here's the general idea for training:
for each epoch:
for each training example d:
feed-forward d through layers 0..n
find error e as a function of expected vs. actual output
back-propagate e through layers n..0
update weights w as a function of w, e, learning and momentum rates
endfor
endfor
I just can't find much info on using PSO to train neural networks or where it fits into the algorithm. Beyond my threadbare (and perhaps incorrect) assumption, I don't know if it's meant for online or batch learning, how the error is found for inner layers without BP, whether PSO replaces or accompanies BP, etc.
I'd love a push in the right direction but not necessarily code as I'm more interested in learning about it first before implementation.
Just for posterity and in case someone else comes across this question: PSO is integrated into neural networks by replacing BP for training. Using the MSE error function along with a set of training examples, you have a continuous and bounded search space and a fitness function, exactly what PSO needs.
initialize a set of random particles in n-dimensions (n = # of weights in network)
perform PSO using swarm of particles
PSO fitness function is network MSE function
MSE function should (always?) uses feed forward to generate sum of errors of found vs target
over time, particles (as an encoding of weights) will find a minimum of MSE
return the best particle after so many iterations, initialize network weights as position
There are other applications you can use PSO in conjunction with neural networks such as hyperparameter selection or model structure selection. I was most interested in training, however.

Convolutional Neural Network with Logarithmic and Exponential activation functions

I am trying to implement a very specific convolutional neural network using Keras.
The key difference is that I have to use non-usual activation functions: log and exp.
The basic structure of the ConvNet is as follows:
Input => Conv2D => Activation Log => Avg Pooling => Activation Exp => ...
The problem is: as the weights get too small, the log activation rapidly reaches -inf.
The reason why I need to use log and exp is because for certain layers I want to simulate what would be a *product pooling *, that is the product of a smaller window (filter) of the current layer.
If I apply log(a) and log(b), I can do a normal average pooling ~ log(a)+log(b) followed by an exp activation which corresponds to that product I want: a * b = exp( log(a)+log(b) ).
To get rid of the -inf I have tried to train the network with SGD and lower learning rate, so I could get larger weights, but it didn't work.
Please, do you have any idea on how could I avoid the -inf due to very small weights, or a smarter way of getting the product pooling without needing a log activation function?
Thank you.
The fact that your activation output is rapidly reaching -inf tells me that the output of your convolution layer is quite small! I cannot provide you with a definite answer to how to avoid the -inf, for it depends on the structure of your network, the range of values of your input, and the task the network is trained to do.
However, I'd suggest two possible solutions:
You may need to reconsider your weight initialization technique. Many practical CNNs initialize from a Gaussian distribution with zero mean and relatively small variances, ~ 1x10^-3, and that will be an issue in you network because majority of weights are very small and will drive the convolution output below one.
Consider lower bounding your activation output, something like a max{..} function.

What's the relationship between an SVM and hinge loss?

My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
I will answer one thing at at time
Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?
SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.
Or is it more complex than that?
No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.
Intuitively look at these three common losses:
hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2
Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.
How do the support vectors come into play?
Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.
What about the slack variables?
This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.
Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).
To sum up:
there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.
Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.
Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Can neural networks approximate any function given enough hidden neurons?

I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
I can't think of how it could. It seems like a very obvious limitation of neural networks that can potentially limit what it can do. For example, because of this limitation, neural networks probably can't properly approximate many functions used in statistics like Exponential Moving Average, or even variance.
Speaking of moving average, can recurrent neural networks properly approximate that? I understand how a feedforward neural network or even a single linear neuron can output a moving average using the sliding window technique, but how would recurrent neural networks do it without X amount of hidden layers (X being the moving average size)?
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that? We would first need to know how many timesteps it should have, which we don't. Perhaps a LSTM network could, but even then, what if it's not a simple moving average, it's an exponential moving average? I don't think even LSTM can do it.
Even worse still, what if f(x,x1) that we are trying to learn is simply
f(x,x1) = x * x1
That seems very simple and straightforward. Can a neural network learn it? I don't see how.
Am I missing something huge here or are machine learning algorithms extremely limited? Are there other learning techniques besides neural networks that can actually do any of this?
The key point to understand is compact:
Neural networks (as any other approximation structure like, polynomials, splines, or Radial Basis Functions) can approximate any continuous function only within a compact set.
In other words the theory states that, given:
A continuous function f(x),
A finite range for the input x, [a,b], and
A desired approximation accuracy ε>0,
then there exists a neural network that approximates f(x) with an approximation error less than ε, everywhere within [a,b].
Regarding your example of f(x) = x2, yes you can approximate it with a neural network within any finite range: [-1,1], [0, 1000], etc. To visualise this, imagine that you approximate f(x) within [-1,1] with a Step Function. Can you do it on paper? Note that if you make the steps narrow enough you can achieve any desired accuracy. The way neural networks approximate f(x) is not much different than this.
But again, there is no neural network (or any other approximation structure) with a finite number of parameters that can approximate f(x) = x2 for all x in [-∞, +∞].
The question is very legitimate and unfortunately many of the answers show how little practitioners seem to know about the theory of neural networks. The only rigorous theorem that exists about the ability of neural networks to approximate different kinds of functions is the Universal Approximation Theorem.
The UAT states that any continuous function on a compact domain can be approximated by a neural network with only one hidden layer provided the activation functions used are BOUNDED, continuous and monotonically increasing. Now, a finite sum of bounded functions is bounded by definition.
A polynomial is not bounded so the best we can do is provide a neural network approximation of that polynomial over a compact subset of R^n. Outside of this compact subset, the approximation will fail miserably as the polynomial will grow without bound. In other words, the neural network will work well on the training set but will not generalize!
The question is neither off-topic nor does it represent the OP's opinion.
I am not sure why there is such a visceral reaction, I think it is a legitimate question that is hard to find by googling it, even though I think it is widely appreciated and repeated outloud. I think in this case you are looking for the actually citations showing that a neural net can approximate any function. This recent paper explains it nicely, in my opinion. They also cite the original paper by Barron from 1993 that proved a less general result. The conclusion: a two-layer neural network can represent any bounded degree polynomial, under certain (seemingly non-restrictive) conditions.
Just in case the link does not work, it is called "Learning Polynomials with Neural Networks" by Andoni et al., 2014.
I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
The only way I can make sense of that question is that you're talking about extrapolation. So e.g. given training samples in the range -1 < x < +1 can a neural network learn the right values for x > 100? Is that what you mean?
If you had prior knowledge, that the functions you're trying to approximate are likely to be low-order polynomials (or any other set of functions), then you could surely build a neural network that can represent these functions, and extrapolate x^2 everywhere.
If you don't have prior knowledge, things are a bit more difficult: There are infinitely many smooth functions that fit x^2 in the range -1..+1 perfectly, and there's no good reason why we would expect x^2 to give better predictions than any other function. In other words: If we had no prior knowledge about the function we're trying to learn, why would we want to learn x -> x^2? In the realm of artificial training sets, x^2 might be a likely function, but in the real world, it probably isn't.
To give an example: Let's say the temperature on Monday (t=0) is 0°, on Tuesday it's 1°, on Wednesday it's 4°. We have no reason to believe temperatures behave like low-order polynomials, so we wouldn't want to infer from that data that the temperature next Monday will probably be around 49°.
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that?
I think that's two questions: First, can a neural network represent that function? I.e. is there a set of weights that would give exactly that behavior? It obviously depends on the network architecture, but I think we can come up with architectures that can represent (or at least closely approximate) this kind of function.
Question two: Can it learn this function, given enough training samples? Well, if your learning algorithm doesn't get stuck in a local minimum, sure: If you have enough training samples, any set of weights that doesn't approximate your function gives a training error greater that 0, while a set of weights that fit the function you're trying to learn has a training error=0. So if you find a global optimum, the network must fit the function.
A network can learn x|->x * x if it has a neuron that calculates x * x. Or more generally, a node that calculates x**p and learns p. These aren't commonly used, but the statement that "no neural network can learn..." is too strong.
A network with ReLUs and a linear output layer can learn x|->2*x, even on an unbounded range of x values. The error will be unbounded, but the proportional error will be bounded. Any function learnt by such a network is piecewise linear, and in particular asymptotically linear.
However, there is a risk with ReLUs: once a ReLU is off for all training examples it ceases learning. With a large domain, it will turn on for some possible test examples, and give an erroneous result. So ReLUs are only a good choice if test cases are likely to be within the convex hull of the training set. This is easier to guarantee if the dimensionality is low. One work around is to prefer LeakyReLU.
One other issue: how many neurons do you need to achieve the approximation you want? Each ReLU or LeakyReLU implements a single change of gradient. So the number needed depends on the maximum absolute value of the second differential of the objective function, divided by the maximum error to be tolerated.
There are theoretical limitations of Neural Networks. No neural network can ever learn the function f(x) = x*x
Nor can it learn an infinite number of other functions, unless you assume the impractical:
1- an infinite number of training examples
2- an infinite number of units
3- an infinite amount of time to converge
NNs are good in learning low-level pattern recognition problems (signals that in the end have some statistical pattern that can be represented by some "continuous" function!), but that's it!
No more!
Here's a hint:
Try to build a NN that takes n+1 data inputs (x0, x1, x2, ... xn) and it will return true (or 1) if (2 * x0) is in the rest of the sequence. And, good luck.
Infinite functions especially those that are recursive cannot be learned. They just are!

Resources