Why are weights and biases necessary in Neural networks ? - machine-learning

So I've been trying to pick up neural networks over the vacation now, and I've gone through a lot of pages about this. Now one thing I don't understand is why do we need weights and biases ?
For weights I've had this intuition that we're trying to multiply certain constants to the inputs such that we could reach the value of y and know the relation, kinda' like y = mx + c. Please help me out with the intuition as well if possible. Thanks in advance :)

I'd like to credit this answer to Jed Fox from this site where I have adapted his explanation. It's a great intro to neural Networks!:
https://github.com/cazala/synaptic/wiki/Neural-Networks-101
Adapted answer:
Neurons in a network are based on neurons found in nature. They take information in and, according to that information, will illicit a certain response. An "activation".
Artificial neurons look like this:
Neuron J
Artificial Neuron
As you can see they have several inputs, for each input there's a weight (the weight of that specific connection). When the artificial neuron activates, it computes its state, by adding all the incoming inputs multiplied by its corresponding connection weight. But neurons always have one extra input, the bias, which is always 1, and has its own connection weight. This makes sure that even when all the inputs are none (all 0s) there's gonna be an activation in the neuron.
After computing its state, the neuron passes it through its activation function, which normalises the result (normally between 0-1).
These weights (and sometimes biases) are what we learn in a neural network. Think of them as a the parameters of the system. Without them they'd be pretty useless!
Additional comment:
In a network, those weighted inputs may come from other neurons so you can begin to see that the weights also describe how neurons related to each other, often signifying the importance of the relationship between 2 neurons.
I hope this helps. There is plenty more information available on the internet and the link above. Consider reading through some of Stanford's Material for CNNs for information om more complicated neural networks.

Related

Where do neurons in a neural network share their predictive results (learned functions)?

Definitely a noob NN question, but here it is:
I understand that neurons in a layer of a neural network all initialize with different (essentially random) input-feature weights as a means to vary their back-propagation results so they can converge to different functions describing the input data. However, I do not understand when or how these neurons generating unique functions to describe the input data "communicate" their results with each other, as is done in ensemble ML methods (e.g. by growing a forest of trees with randomized initial-decision criteria and then determining the most discriminative models in the forest). In the trees ensemble example, all of the trees are working together to generalize the rules each model learns.
How, where, and when do neurons communicate their prediction functions? I know individual neurons use gradient descent to converge to their respective functions, but they are unique since they started with unique weights. How do they communicate these differences? I imagine there's some subtle behavior in combining the neuronic results in the output layer where this communication is occurring. Also, is this communication part of the iterative training process?
Someone in the comments section (https://datascience.stackexchange.com/questions/14028/what-is-the-purpose-of-multiple-neurons-in-a-hidden-layer) asked a similar question, but I didn't see it answered.
Help would be greatly appreciated!
During propagation, each neuron typically participates in forming the value in multiple neurons of the next layer. In back-propagation, each of those next-layer neurons will try to push the participating neurons' weights around in order to minimise the error. That's pretty much it.
For example, let's say you're trying to get a NN to recognise digits. Let's say that one neuron in a hidden layer starts getting ideas on recognising vertical lines, another starts finding horisontal lines, and so on. The next-layer neuron that is responsible for finding 1 will see that if it wants to be more accurate, it should pay lots of attention to the vertical line guy; and also the more the horisontal line guy yells the more it's not a 1. That's what weights are: telling each neuron how strongly it should care about each of its inputs. In turn, the vertical line guy will learn how to recognise vertical lines better, by adjusting weights for its input layer (e.g. individual pixels).
(This is quite abstract though. No-one told the vertical line guy that he should be recognising vertical lines. Different neurons just train for different things, and by the virtue of mathematics involved, they end up picking different features. One of them might or might not end up being vertical line.)
There is no "communication" between neurons on the same layer (in the base case, where layers flow linearly from one to the next). It's all about neurons on one layer getting better at predicting features that the next layer finds useful.
At the output layer, the 1 guy might be saying "I'm 72% certain it's a 1", while the 7 guy might be saying "I give that 7 a B+", while the third one might be saying "A horrible 3, wouldn't look at twice". We instead usually either take whoever's loudest's word for it, or we normalise the output layer (divide by the sum of all outputs) so that we have actual comparable probabilities. However, this normalisation is not actually a part of neural network itself.

Large weights in NN

My NN when training seems to produce weights that are very large. Is it okay if weights in a neural network have values greater than 1 or less than -1?
Furthermore, the extremely large weights tend to be in the connections between the input and first hidden layer. For example, the input-hidden layer weights look something like this:
-12.728901995585,-13.2337212413569,5.73922593605989,-5.12803672380726......
Whereas the hidden-output weights look more like:
-0.00434217225630834,0.130458439630824,0.153923956195796,0.59407334088441
The NN functions fine, however the large weights are concerning as usually I seem them being between -1 and 1. Are larger weights okay?
Thanks.
According to this example https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ your NN's weights between input and hidden layers may seem to be off the usual (assuming feed forward NNs). Towards the end of the given example you will see how smaller the changes are generated through back-propagation.
But depending on the training iteration count and NN architecture it is possible to have large weights too. Also larger weights mean the preceding neuron has more effect than the following neuron on NN's decision.
In general NN's interior values are difficult to understand. If your NN does the job, I guess it's OK to have large weights between some layers.
I am unfortunately not very proficient on the subject. So corrections and their explanations are welcome if any. I am looking forward to learn from experts.

Is the bias node necessary in very large neural networks?

I understand the role of the bias node in neural nets, and why it is important for shifting the activation function in small networks. My question is this: is the bias still important in very large networks (more specifically, a convolutional neural network for image recognition using the ReLu activation function, 3 convolutional layers, 2 hidden layers, and over 100,000 connections), or does its affect get lost by the sheer number of activations occurring?
The reason I ask is because in the past I have built networks in which I have forgotten to implement a bias node, however upon adding one have seen a negligible difference in performance. Could this have been down to chance, in that the specifit data-set did not require a bias? Do I need to initialise the bias with a larger value in large networks? Any other advice would be much appreciated.
The bias node/term is there only to ensure the predicted output will be unbiased. If your input has a dynamic (range) that goes from -1 to +1 and your output is simply a translation of the input by +3, a neural net with a bias term will simply have the bias neuron with a non-zero weight while the others will be zero. If you do not have a bias neuron in that situation, all the activation functions and weigh will be optimized so as to mimic at best a simple addition, using sigmoids/tangents and multiplication.
If both your inputs and outputs have the same range, say from -1 to +1, then the bias term will probably not be useful.
You could have a look at the weigh of the bias node in the experiment you mention. Either it is very low, and it probably means the inputs and outputs are centered already. Or it is significant, and I would bet that the variance of the other weighs is reduced, leading to a more stable (and less prone to overfitting) neural net.
Bias is equivalent to adding a constant like 1 to the input of every layer. Then the weight to that constant is equivalent to your bias. It's really simple to add.
Theoretically it isn't necessary since the network can "learn" to create it's own bias node on every layer. One of the neurons can set it's weight very high so it's always 1, or at 0 so it always outputs a constant 0.5 (for sigmoid units.) This requires at least 2 layers though.

What is a Recurrent Neural Network, what is a Long Short Term Memory (LSTM) network, and is it always better? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
First, let me apologize for cramming three questions in that title. I'm not sure what better way is there.
I'll get right to it. I think I understand feedforward neural networks pretty well.
But LSTM really escapes me, and I feel maybe this is because I don't have a very good grasp of Recurrent neural networks in general. I have went through Hinton's and Andrew Ng's course on Coursera. A lot of it still doesn't make sense to me.
From what I understood, recurrent neural networks are different from feedforward neural networks in that past values influence the next prediction. Recurrent neural network are generally used for sequences.
The example I saw of recurrent neural network was binary addition.
010
+ 011
A recurrent neural network would take the right most 0 and 1 first, output a 1. Then take the 1,1 next, output a zero, and carry the 1. Take the next 0,0 and output a 1 because it carried the 1 from last calculation. Where does it store this 1? In feed forward networks the result is basically:
y = a(w*x + b)
where w = weights of connections to previous layer
and x = activation values of previous layer or inputs
How is a recurrent neural network calculated? I am probably wrong but from what I understood, recurrent neural networks are pretty much feedforward neural network with T hidden layers, T being number of timesteps. And each hidden layer takes the X input at timestep T and it's outputs are then added to the next respective hidden layer's inputs.
a(l) = a(w*x + b + pa)
where l = current timestep
and x = value at current timestep
and w = weights of connections to input layer
and pa = past activation values of hidden layer
such that neuron i in layer l uses the output value of neuron i in layer l-1
y = o(w*a(l-1) + b)
where w = weights of connections to last hidden layer
But even if I understood this correctly, I don't see the advantage of doing this over simply using past values as inputs to a normal feedforward network (sliding window or whatever it's called).
For example, what is the advantage of using a recurrent neural network for binary addition instead of than training a feedforward network with two output neurons. One for the binary result and the other for the carry? And then take the carry output and plug it back into the feedforward network.
However, I'm not sure how is this different than simply having past values as inputs in a feedforward model.
It seems to me that the more timesteps there are, recurrent neural networks are only a disadvantage over feedforward networks because of vanishing gradient. Which brings me to my second question, from what I understood, LSTM is a solution to the problem of vanishing gradient. But I have no actual grasp of how they work. Furthermore, are they simply better than recurrent neural networks, or are there sacrifices to using a LSTM?
What is a Recurrent neural network?
The basic idea is that recurrent networks have loops. These loops allow the network to use information from previous passes, which acts as memory. The length of this memory depends on a number of factors but it is important to note that it is not indefinite. You can think of the memory as degrading, with older information being less and less usable.
For example, let's say we just want the network to do one thing: Remember whether an input from earlier was 1, or 0. It's not difficult to imagine a network which just continually passes the 1 around in a loop. However every time you send in a 0, the output going into the loop gets a little lower (This is a simplification, but displays the idea). After some number of passes the loop input will be arbitrarily low, making the output of the network 0. As you are aware, the vanishing gradient problem is essentially the same, but in reverse.
Why not just use a window of time inputs?
You offer an alternative: A sliding window of past inputs being provided as current inputs. That's is not a bad idea, but consider this: While the RNN may have eroded over time, you will always lose the entirety of your time information after you window ends. And while you would remove the vanishing gradient problem, you would have to increase the number of weights of your network by several times. Having to train all those additional weights will hurt you just as badly as (if not worse than) vanishing gradient.
What is an LSTM network?
You can think of LSTM as a special type of RNN. The difference is that LSTM is able to actively maintain self connecting loops without them degrading. This is accomplished through a somewhat fancy activation, involving an additional "memory" output for the self looping connection. The network must then be trained to select what data gets put onto this bus. By training the network to explicit select what to remember, we don't have to worry about new inputs destroying important information, and the vanishing gradient doesn't affect the information we decided to keep.
There are two main drawbacks:
It is more expensive to calculate the network output and apply back propagation. You simply have more math to do because of the complex activation. However this is not as important as the second point.
The explicit memory adds several more weights to each node, all of which must be trained. This increases the dimensionality of the problem, and potentially makes it harder to find an optimal solution.
Is it always better?
Which structure is better depends on a number of factors, like the number of nodes you need for you problem, the amount of available data, and how far back you want your network's memory to reach. However if you only want the theoretical answer, I would say that given infinite data and computing speed, an LSTM is the better choice, however one should not take this as practical advice.
A feed forward neural network has connections from layer n to layer n+1.
A recurrent neural network allows connections from layer n to layer n as well.
These loops allow the network to perform computations on data from previous cycles, which creates a network memory. The length of this memory depends on a number of factors and is an area of active research, but could be anywhere from tens to hundreds of time steps.
To make it a bit more clear, the carried 1 in your example is stored in the same way as the inputs: in a pattern of activation of a neural layer. It's just the recurrent (same layer) connections that allow the 1 to persist through time.
Obviously it would be infeasible to replicate every input stream for more than a few past time steps, and choosing which historical streams are important would be very difficult (and lead to reduced flexibility).
LSTM is a very different model which I'm only familiar with by comparison to the PBWM model, but in that review LSTM was able to actively maintain neural representations indefinitely, so I believe it is more intended for explicit storage. RNNs are more suited to non-linear time series learning, not storage. I don't know if there are drawbacks to using LSTM rather RNNs.
Both RNN and LSTM can be sequence learners. RNN suffers from vanishing gradient point problem. This problem causes the RNN to have trouble in remembering values of past inputs after more than 10 timesteps approx. (RNN can remember previously seen inputs for a few time steps only)
LSTM is designed to solve the vanishing gradient point problem in RNN. LSTM has the capability of bridging long time lags between inputs. In other words, it is able to remember inputs from up to 1000 time steps in the past (some papers even made claims it can go more than this). This capability makes LSTM an advantage for learning long sequences with long time lags. Refer to Alex Graves Ph.D. thesis Supervised Sequence Labelling
with Recurrent Neural Networks for some details. If you are new to LSTM, I recommend Colah's blog for super simple and easy explanation.
However, recent advances in RNN also claim that with careful initialization, RNN can also learn long sequences comparable to the performance of LSTM. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.

Interpreting Neural Network output

For a classification problem, how is the output of the network usually determined?
Say, there are three possible classes, each with a numerical identifier, would a reasonable solution be to sum the outputs and take that sum as the overall output of the network? Or would you take the average of the networks outputs?
There is plenty of information regarding ANN theory, but not much about application, but I apoligise if this is a silly question.
For a multi-layer perceptron classifier with 3 classes, one typically constructs a network with 3 outputs and trains the network so that (1,0,0) is the target output for the first class, (0,1,0) for the second class, and (0,0,1) for the third class. For classifying a new observation, you typically select the output with the greatest value (e.g., (0.12, 0.56, 0.87) would be classified as class 3).
I agree mostly with bogatron and further you will find many posts here advising on this kind of "multi-class classification" with neural networks.
Regarding your heading I would like to add that you can interpret that output as a probability since I struggled to find theoretical foundation for this. Going on I'll talk about a neural network with 3 neurons in the output layer, indicating 1 for the respective class.
Since the sum of all three outputs will always be 1 in training, the neural network will also give feed-forward output with a sum of one (so rather (0.12 0.36 0.52) than bogatrons example)) Then you can interpret these figures as the probability that the respective input belongs to class 1/2/3 (probability is 0.52 that it belongs to class 3)).
This is true when using the logistic function or the tanh as activation functions.
More on this:
Posterior probability via neural networks: http://www-vis.lbl.gov/~romano/mlgroup/papers/neural-networks-survey.pdf
How to convert the output of an artificial neural network into probabilities?

Resources