Is the bias node necessary in very large neural networks? - machine-learning

I understand the role of the bias node in neural nets, and why it is important for shifting the activation function in small networks. My question is this: is the bias still important in very large networks (more specifically, a convolutional neural network for image recognition using the ReLu activation function, 3 convolutional layers, 2 hidden layers, and over 100,000 connections), or does its affect get lost by the sheer number of activations occurring?
The reason I ask is because in the past I have built networks in which I have forgotten to implement a bias node, however upon adding one have seen a negligible difference in performance. Could this have been down to chance, in that the specifit data-set did not require a bias? Do I need to initialise the bias with a larger value in large networks? Any other advice would be much appreciated.

The bias node/term is there only to ensure the predicted output will be unbiased. If your input has a dynamic (range) that goes from -1 to +1 and your output is simply a translation of the input by +3, a neural net with a bias term will simply have the bias neuron with a non-zero weight while the others will be zero. If you do not have a bias neuron in that situation, all the activation functions and weigh will be optimized so as to mimic at best a simple addition, using sigmoids/tangents and multiplication.
If both your inputs and outputs have the same range, say from -1 to +1, then the bias term will probably not be useful.
You could have a look at the weigh of the bias node in the experiment you mention. Either it is very low, and it probably means the inputs and outputs are centered already. Or it is significant, and I would bet that the variance of the other weighs is reduced, leading to a more stable (and less prone to overfitting) neural net.

Bias is equivalent to adding a constant like 1 to the input of every layer. Then the weight to that constant is equivalent to your bias. It's really simple to add.
Theoretically it isn't necessary since the network can "learn" to create it's own bias node on every layer. One of the neurons can set it's weight very high so it's always 1, or at 0 so it always outputs a constant 0.5 (for sigmoid units.) This requires at least 2 layers though.

Related

Why are weights and biases necessary in Neural networks ?

So I've been trying to pick up neural networks over the vacation now, and I've gone through a lot of pages about this. Now one thing I don't understand is why do we need weights and biases ?
For weights I've had this intuition that we're trying to multiply certain constants to the inputs such that we could reach the value of y and know the relation, kinda' like y = mx + c. Please help me out with the intuition as well if possible. Thanks in advance :)
I'd like to credit this answer to Jed Fox from this site where I have adapted his explanation. It's a great intro to neural Networks!:
https://github.com/cazala/synaptic/wiki/Neural-Networks-101
Adapted answer:
Neurons in a network are based on neurons found in nature. They take information in and, according to that information, will illicit a certain response. An "activation".
Artificial neurons look like this:
Neuron J
Artificial Neuron
As you can see they have several inputs, for each input there's a weight (the weight of that specific connection). When the artificial neuron activates, it computes its state, by adding all the incoming inputs multiplied by its corresponding connection weight. But neurons always have one extra input, the bias, which is always 1, and has its own connection weight. This makes sure that even when all the inputs are none (all 0s) there's gonna be an activation in the neuron.
After computing its state, the neuron passes it through its activation function, which normalises the result (normally between 0-1).
These weights (and sometimes biases) are what we learn in a neural network. Think of them as a the parameters of the system. Without them they'd be pretty useless!
Additional comment:
In a network, those weighted inputs may come from other neurons so you can begin to see that the weights also describe how neurons related to each other, often signifying the importance of the relationship between 2 neurons.
I hope this helps. There is plenty more information available on the internet and the link above. Consider reading through some of Stanford's Material for CNNs for information om more complicated neural networks.

What does "sparse" mean in the context of neural nets?

I've seen "sparse" and "sparsity" used in a way that suggests it's something that improves a model's accuracy. For example:
I think the unsupervised phase might be not so important if some
sparse connections or neurons are used, such as rectifier units or
convolutional connection, and big training data is available.
From https://www.quora.com/When-does-unsupervised-pre-training-improve-classification-accuracy-for-a-deep-neural-network-When-does-it-not
What does "sparse" mean in this context?
TL;DR: Sparsity means most of the weights are 0. This can lead to an increase in space and time efficiency.
Detailed version: In general, neural networks are represented as tensors. Each layer of neurons is represented by a matrix. Each entry in the matrix can be thought of as representative of the connection between two neurons. In a simple neural network, like a classic feed-forward neural network, every neuron on a given layer is connected to every neuron on the subsequent layer. This means that each layer must have n2 connections represented, where n is the size of both of the layers. In large networks, this can take a lot of memory and time to propagate. Since different parts of a neural network often work on different subtasks, it can be unnecessary for every neuron to be connected to every neuron in the next layer. In fact, it might make sense for a neural network to have most pairs of neurons with a connection weight of 0. Training a neural network might result in these less significant connection weights adopting values very close to 0 but accuracy would not be significantly affected if the values were exactly 0.
A matrix in which most entries are 0 is called a sparse matrix. These matrices can be stored more efficiently and certain computations can be carried out more efficiently on them provided the matrix is sufficiently large and sparse. Neural networks can leverage the efficiency gained from sparsity by assuming most connection weights are equal to 0.
I must say that neural networks are a complex and diverse topic. There are a lot of approaches used. There are certain kinds of neural networks with different morphologies than the simple layer connections I referenced above. Sparsity can be leveraged in many types of neural networks since matrices are fairly universal to neural network representation.
Sparse, as can be deduced from the meaning in layman English refers to sparsity in the connections between neurons, basically, the weights have non-significant values (close to 0)
In some cases it might also refer to cases where we do not have all connections and very less connections itself (less weights)

How to propagate uncertainty into the prediction of a neural network?

I have inputs x_1, ..., x_n that have known 1-sigma uncertainties e_1, ..., e_n. I am using them to predict outputs y_1, ..., y_m on a trained neural network. How can I obtain 1-sigma uncertainties on my predictions?
My idea is to randomly perturb each input x_i with normal noise having mean 0 and standard deviation e_i a large number of times (say, 10000), and then take the median and standard deviation of each prediction y_i. Does this work?
I fear that this only takes into account the "random" error (from the measurements) and not the "systematic" error (from the network), i.e., each prediction inherently has some error to it that is not being considered in this approach. How can I properly obtain 1-sigma error bars on my predictions?
You can get a general analysis of what "jittering" (generation of random samples) brings to the neural network optimization here http://wojciechczarnecki.com/pdfs/preprint-ml-with-unc.pdf
In short - jittering is just a regularization on network's weights.
For errors bars as such you should refer to works of Will Penny
http://www.fil.ion.ucl.ac.uk/~wpenny/publications/error_bars.ps
http://www.fil.ion.ucl.ac.uk/~wpenny/publications/nnerrors.ps
u r right. That method only takes the data uncertainty into account (assuming u don't fit the neural net while applying the noise). As a side note, alternatively when fitting the data using a neural net u may also apply mixture density networks (see one of the many tutorials).
More importantly, in order to account for model uncertainty u should apply bayesian neural nets. U could could start e.g. with Monte-Carlo dropout. Also very interesting should be this work on performing sampling-free inference when using Monte-Carlo dropout
https://arxiv.org/abs/1908.00598
This work explicitly uses error propagation through neural networks and should be very interesting for u!
Best

Can neural networks approximate any function given enough hidden neurons?

I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
I can't think of how it could. It seems like a very obvious limitation of neural networks that can potentially limit what it can do. For example, because of this limitation, neural networks probably can't properly approximate many functions used in statistics like Exponential Moving Average, or even variance.
Speaking of moving average, can recurrent neural networks properly approximate that? I understand how a feedforward neural network or even a single linear neuron can output a moving average using the sliding window technique, but how would recurrent neural networks do it without X amount of hidden layers (X being the moving average size)?
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that? We would first need to know how many timesteps it should have, which we don't. Perhaps a LSTM network could, but even then, what if it's not a simple moving average, it's an exponential moving average? I don't think even LSTM can do it.
Even worse still, what if f(x,x1) that we are trying to learn is simply
f(x,x1) = x * x1
That seems very simple and straightforward. Can a neural network learn it? I don't see how.
Am I missing something huge here or are machine learning algorithms extremely limited? Are there other learning techniques besides neural networks that can actually do any of this?
The key point to understand is compact:
Neural networks (as any other approximation structure like, polynomials, splines, or Radial Basis Functions) can approximate any continuous function only within a compact set.
In other words the theory states that, given:
A continuous function f(x),
A finite range for the input x, [a,b], and
A desired approximation accuracy ε>0,
then there exists a neural network that approximates f(x) with an approximation error less than ε, everywhere within [a,b].
Regarding your example of f(x) = x2, yes you can approximate it with a neural network within any finite range: [-1,1], [0, 1000], etc. To visualise this, imagine that you approximate f(x) within [-1,1] with a Step Function. Can you do it on paper? Note that if you make the steps narrow enough you can achieve any desired accuracy. The way neural networks approximate f(x) is not much different than this.
But again, there is no neural network (or any other approximation structure) with a finite number of parameters that can approximate f(x) = x2 for all x in [-∞, +∞].
The question is very legitimate and unfortunately many of the answers show how little practitioners seem to know about the theory of neural networks. The only rigorous theorem that exists about the ability of neural networks to approximate different kinds of functions is the Universal Approximation Theorem.
The UAT states that any continuous function on a compact domain can be approximated by a neural network with only one hidden layer provided the activation functions used are BOUNDED, continuous and monotonically increasing. Now, a finite sum of bounded functions is bounded by definition.
A polynomial is not bounded so the best we can do is provide a neural network approximation of that polynomial over a compact subset of R^n. Outside of this compact subset, the approximation will fail miserably as the polynomial will grow without bound. In other words, the neural network will work well on the training set but will not generalize!
The question is neither off-topic nor does it represent the OP's opinion.
I am not sure why there is such a visceral reaction, I think it is a legitimate question that is hard to find by googling it, even though I think it is widely appreciated and repeated outloud. I think in this case you are looking for the actually citations showing that a neural net can approximate any function. This recent paper explains it nicely, in my opinion. They also cite the original paper by Barron from 1993 that proved a less general result. The conclusion: a two-layer neural network can represent any bounded degree polynomial, under certain (seemingly non-restrictive) conditions.
Just in case the link does not work, it is called "Learning Polynomials with Neural Networks" by Andoni et al., 2014.
I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
The only way I can make sense of that question is that you're talking about extrapolation. So e.g. given training samples in the range -1 < x < +1 can a neural network learn the right values for x > 100? Is that what you mean?
If you had prior knowledge, that the functions you're trying to approximate are likely to be low-order polynomials (or any other set of functions), then you could surely build a neural network that can represent these functions, and extrapolate x^2 everywhere.
If you don't have prior knowledge, things are a bit more difficult: There are infinitely many smooth functions that fit x^2 in the range -1..+1 perfectly, and there's no good reason why we would expect x^2 to give better predictions than any other function. In other words: If we had no prior knowledge about the function we're trying to learn, why would we want to learn x -> x^2? In the realm of artificial training sets, x^2 might be a likely function, but in the real world, it probably isn't.
To give an example: Let's say the temperature on Monday (t=0) is 0°, on Tuesday it's 1°, on Wednesday it's 4°. We have no reason to believe temperatures behave like low-order polynomials, so we wouldn't want to infer from that data that the temperature next Monday will probably be around 49°.
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that?
I think that's two questions: First, can a neural network represent that function? I.e. is there a set of weights that would give exactly that behavior? It obviously depends on the network architecture, but I think we can come up with architectures that can represent (or at least closely approximate) this kind of function.
Question two: Can it learn this function, given enough training samples? Well, if your learning algorithm doesn't get stuck in a local minimum, sure: If you have enough training samples, any set of weights that doesn't approximate your function gives a training error greater that 0, while a set of weights that fit the function you're trying to learn has a training error=0. So if you find a global optimum, the network must fit the function.
A network can learn x|->x * x if it has a neuron that calculates x * x. Or more generally, a node that calculates x**p and learns p. These aren't commonly used, but the statement that "no neural network can learn..." is too strong.
A network with ReLUs and a linear output layer can learn x|->2*x, even on an unbounded range of x values. The error will be unbounded, but the proportional error will be bounded. Any function learnt by such a network is piecewise linear, and in particular asymptotically linear.
However, there is a risk with ReLUs: once a ReLU is off for all training examples it ceases learning. With a large domain, it will turn on for some possible test examples, and give an erroneous result. So ReLUs are only a good choice if test cases are likely to be within the convex hull of the training set. This is easier to guarantee if the dimensionality is low. One work around is to prefer LeakyReLU.
One other issue: how many neurons do you need to achieve the approximation you want? Each ReLU or LeakyReLU implements a single change of gradient. So the number needed depends on the maximum absolute value of the second differential of the objective function, divided by the maximum error to be tolerated.
There are theoretical limitations of Neural Networks. No neural network can ever learn the function f(x) = x*x
Nor can it learn an infinite number of other functions, unless you assume the impractical:
1- an infinite number of training examples
2- an infinite number of units
3- an infinite amount of time to converge
NNs are good in learning low-level pattern recognition problems (signals that in the end have some statistical pattern that can be represented by some "continuous" function!), but that's it!
No more!
Here's a hint:
Try to build a NN that takes n+1 data inputs (x0, x1, x2, ... xn) and it will return true (or 1) if (2 * x0) is in the rest of the sequence. And, good luck.
Infinite functions especially those that are recursive cannot be learned. They just are!

Why use tanh for activation function of MLP?

Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

Resources