backpropagation with more than one node per layer - machine-learning

I read this article about how backpropagation works, and I understood everything they said. They said that to find the gradient we have to take a partial derivative of the cost function to each weight/bias. However, to explain this they used a network which had one node per layer. How do you do backpropagation for a network which has more than one node per layer?​

I haven't checked too thoroughly on the math I put forward here, so if anyone sees an error here, please correct me!
Anyways, the image down here is a very simple example of backpropagation. As you can see, we're interested in the gradients of the Loss function L (in this case the loss function is extremely simple and not good outside this example) with respect to the weights W in order to update the weights according to the gradient descent optimizer (there are other, better optimizers, but gradient descent is the easiest one to understand, so I suggest you read up on it). I guess the key to your understanding is the first equation in the box, where you can see that you first use the chain rule and then sum up all the gradients this gives you.
For further understanding, I suggest that you write up all your equations for forward propagation, and then calculate the chain rule for dL/dW and dL/da at each layer. It might also be easier if you break up the equations even further and set a = f(z), z = W * X (in order to make the chain rule more intuitive; i.e. dL/dW = dL/da*da/dz*dz/dW). There are also multiple guides out there that you can read up on for further understanding.

Related

Can gradient descent itself solve non-linear problem in ANN?

I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.

Can a linear classifier represent logic gates requiring more than one operator?

For example how do you get the 0 and 1 table that represents for example a OR (b AND NOT c). I under the individual logic gates but how does put the pieces together. I've tried but to no avail.
Check this example. You can break your expression to few steps to easily understand each step.
I'm not aware of a general method to find a linear decision function. What you can do is training a linear classifier on the boolean table and see the result.
A possible decision function for OR could be:
if A+B-0.5>0:
return 1
else:
return 0
Here is an interesting link related to your question:
It is possible to use neural networks to represent some logic gates like XOR.
https://towardsdatascience.com/how-neural-networks-solve-the-xor-problem-59763136bdd7
Basically, this link, explains the methodology taken in the book deep learning p.171 (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Where the author is trying to learn XOR function. No matter how you try to place your decision boundary you will have a miss label (coordinates can't be on the decision boundary). Has your you can see in the first image: XOR with decision boundary attempt. They will train an MSE loss function on a 2 layers network.
However, I think this question is more directed toward the math stack exchange. Here is a link if you want to learn more : https://datascience.stackexchange.com/questions/11589/creating-neural-net-for-xor-function/19799#19799

Can Residual Nets skip one linearity instead of two?

Standard in ResNets is to skip 2 linearities.
Would skipping only one work as well?
I would refer you to the original paper by Kaiming He at al.
In sections 3.1-3.2, they define "identity" shortcuts as y = F(x, W) + x, where W are the trainable parameters, for any residual mapping F to be learned. It is important that the residual mapping contains a non-linearity, otherwise the whole construction is one sophisticated linear layer. But the number of linearities is not limited.
For example, ResNeXt network creates identity shortcuts around a stack of only convolutional layers (see the figure below). So there aren't any dense layers in the residual block.
The general answer is, thus: yes, it would work. However, in a particular neural network, reducing two dense layers to one may be a bad idea, because anyway the residual block must be flexible enough to learn the residual function. So remember to validate any design you come up with.

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Neural networks - why so many learning rules?

I'm starting neural networks, currently following mostly D. Kriesel's tutorial. Right off the beginning it introduces at least three (different?) learning rules (Hebbian, delta rule, backpropagation) concerning supervised learning.
I might be missing something, but if the goal is merely to minimize the error, why not just apply gradient descent over Error(entire_set_of_weights)?
Edit: I must admit the answers still confuse me. It would be helpful if one could point out the actual difference between those methods, and the difference between them and straight gradient descent.
To stress it, these learning rules seem to take the layered structure of the network into account. On the other hand, finding the minimum of Error(W) for the entire set of weights completely ignores it. How does that fit in?
One question is how to apportion the "blame" for an error. The classic Delta Rule or LMS rule is essentially gradient descent. When you apply Delta Rule to a multilayer network, you get backprop. Other rules have been created for various reasons, including the desire for faster convergence, non-supervised learning, temporal questions, models that are believed to be closer to biology, etc.
On your specific question of "why not just gradient descent?" Gradient descent may work for some problems, but many problems have local minima, which naive gradient descent will get stuck in. The initial response to that is to add a "momentum" term, so that you might "roll out" of a local minimum; that's pretty much the classic backprop algorithm.
First off, note that "backpropagation" simply means that you apply the delta rule on each layer from output back to input so it's not a separate rule.
As for why not a simple gradient descent, well, the delta rule is basically gradient descent. However, it tends to overfit the training data and doesn't generalize as efficiently as techniques which don't try to decay the error margin to zero. This makes sense because "error" here simply means the difference between our samples and the output - they are not guaranteed to accurately represent all possible inputs.
Backpropagation and naive gradient descent also differ in computational efficiency. Backprop is basically taking the networks structure into account and for each weight only calculates the actually needed parts.
The derivative of the error with respects to the weights is splitted via the chainrule into: ∂E/∂W = ∂E/∂A * ∂A/∂W. A is the activations of particular units. In most cases, the derivatives will be zero because W is sparse due to the networks topology. With backprop, you get the learning rules on how to ignore those parts of the gradient.
So, from a mathematical perspective, backprop is not that exciting.
there may be problems which for example make backprop run into local minima. Furthermore, just as an example, you can't adjust the topology with backprop. There are also cool learning methods using nature-inspired metaheuristics (for instance, evolutionary strategies) that enable adjusting weight AND topology (even recurrent ones) simultaneously. Probably, I will add one or more chapters to cover them, too.
There is also a discussion function right on the download page of the manuscript - if you find other hazzles that you don't like about the manuscript, feel free to add them to the page so I can change things in the next edition.
Greetz,
David (Kriesel ;-) )

Resources