It seems there is a bit of confusion between activation and transfer function. From Wikipedia ANN:
It seems that the transfer function calculates the net while the activation function the output of the neuron. But on Matlab documentation of an activation function I quote:
satlin(N, FP) is a neural transfer function. Transfer functions calculate a layer's output from its net input.
So who is right? And can you use the term activation function or transfer function interchangeably?
No, they are the same. I also quote from wikipedia: "Usually the sums of each node are weighted, and the sum is passed through a non-linear function known as an activation function or transfer function. Don't take the matlab documentation too literally, it's thousand of pages long so some words might not be used in their strict sense.
In machine learning at least, they are used interchangeably by all books I've read.
activation function is used almost exclusively nowadays.
transfer function is mostly used in older (80/90's) books, when machine learning was uncommon, and most readers had an electrical engineering/signal processing background.
So, to sum up
prefer the term activation function. It's more common, and more appropriate, both from a biological point of view (neuron fires when you surpass a threshold) and an engineering point of view (an actual transfer function should describe the whole system)
if anyone else makes a distinction between them, ask them to clear up what they mean
After some research I've found in "Survey of Neural Transfer Functions", from Duch and Jankowski (1999) that:
transfer_function = activation function + output function
And IMO the terminology makes sense now since we need to have a value (signal strength) to verify it the neuron will be activated and then compute an output from it. And what the whole process do is to transfer a signal from one layer to another.
Two functions determine the way signals are processed by neurons. The
activation function determines the total signal a neuron receives. The value of the activation function is usually scalar and the
arguments are vectors. The second function determining neuron’s
signal processing is the output function o(I), operating on scalar
activations and returning scalar values. Typically a squashing
function is used to keep the output values within specified bounds.
These two functions together determine the values of the neuron
outgoing signals. The composition of the activation and the output
function is called the transfer function o(I(x)).
I think the diagram is correct but not terminologically accurate.
The transfer function includes both the activation and transfer functions in your diagram. What is called transfer function in your diagram is usually referred to as the net input function. The net input function only adds weights to the inputs and calculates the net input, which is usually equal to the sum of the inputs multiplied by given weights. The activation function, which can be a sigmoid, step, etc. function, is applied to the net input to generate the output.
Transfer function come from the name transformation and are used for transformation purposes. On the other hand, activation function checks for the output if it meets a certain threshold and either outputs zero or one. Some examples of non-linear transfer functions are softmax and sigmoid.
For example, suppose we have continuous input signal x(t). This input signal is transformed into an output signal y(t) through a transfer function H(s).
Y(s) = H(s)X(s)
Transfer function H(s) as can be seen above changes the state of the input X(s) into a new output state Y(s) through transformation.
A closer look at H(s) shows that it can represents a weight in a neural network. Therefore, H(s)X(s) is simply the multiplication of the input signal and its weight. Several of these input-weight pairs in a given layer are then summed up to form the input of another layer. This means that input to any layer to a neural network is simply the transfer function of its input and the weight, i.e a linear transformation because the input is now transformed by the weights. But in the real world, problems are non-linear in nature. Therefore, to make the incoming data non-linear, we use a non-linear mapping called activation function. An activation function is a decision making function that determines the presence of particular neural feature. It is mapped between 0 and 1, where zero mean the feature is not there, while one means the feature is present. Unfortunately, the small changes occurring in the weights cannot be reflected in the activation value because it can only take either 0 or 1. Therefore, nonlinear finctions must be continuous and differentiable between this range.
In really sense before outputting an activation, you calculate the sigmoid first since it is continuous and differential and then use it as an input to an activation function which checks whether the output of the sigmoid is higher than its activation threshhold. A neural network must be able to take any input from -infinity to +positive infinite, but it should be able to map it to an output that ranges between {0,1} or between {-1,1} in some cases - thus the need for activation function.
I am also a newbie in machine learning field. From what I understand...
Transfer function:
Transfer function calculates the net weight, so you need to modify your code or calculation it need to be done before Transfer function. You can use various transfer function as suitable with you task.
Activation function: This is used for calculating threshold value i.e. when your network will give the output. If your calculated result is greater then threshold value it will show output otherwise not.
Hope this helps.
Related
I am trying to simulate a XOR gate using a neural network similar to this:
Now I understand that each neuron has certain number of weights and a bias. I am using a sigmoid function to determine whether a neuron should fire or not in each state (since this uses a sigmoid rather than a step function, I use firing in a loose sense as it actually spits out real values).
I successfully ran the simulation for feed-forwarding part, and now I want to use the backpropagation algorithm to update the weights and train the model. The question is, for each value of x1 and x2 there is a separate result (4 different combinations in total) and under different input pairs, separate error distances (the difference between the desired output and the actual result) could be be computed and subsequently a different set of weight updates will eventually be achieved. This means we would get 4 different sets of weight updates for each separate input pairs by using backpropagation.
How should we decide about the right weight updates?
Say we repeat the back propagation for a single input pair until we converge, but what if we would converge to a different set of weights if we choose another pair of inputs?
Now I understand that each neuron has certain weights. I am using a sigmoid function to determine a neuron should fire or not in each state.
You do not really "decide" this, typical MLP do not "fire", they output real values. There are neural networks which actually fire (like RBMs) but this is a completely different model.
This means we would get 4 different sets of weight updates for each input pairs by using back propagation.
This is actually a feature. Lets start from the beggining. You try to minimize some loss function on your whole training set (in your case - 4 samples), which is of form:
L(theta) = SUM_i l(f(x_i), y_i)
where l is some loss function, f(x_i) is your current prediction and y_i true value. You do this by gradient descent, thus you try to compute the gradient of L and go against it
grad L(theta) = grad SUM_i l(f(x_i), y_i) = SUM_i grad l(f(x_i), y_i)
what you now call "a single update" is grad l(f(x_i) y_i) for a single training pair (x_i, y_i). Usually you would not use this, but instead you would sum (or taken average) of updates across whole dataset, as this is your true gradient. Howver, in practise this might be computationaly not feasible (training set is usualy quite large), furthermore, it has been shown empirically that more "noise" in training is usually better. Thus another learning technique emerged, called stochastic gradient descent, which, in short words, shows that under some light assumptions (like additive loss function etc.) you can actually do your "small updates" independently, and you will still converge to local minima! In other words - you can do your updates "point-wise" in random order and you will still learn. Will it be always the same solution? No. But this is also true for computing whole gradient - optimization of non-convex functions is nearly always non-deterministic (you find some local solution, not global one).
I am learning neural networks for the first time. I was trying to understand how using a single hidden layer function approximation can be performed. I saw this example on stackexchange but I had some questions after going through one of the answers.
Suppose I want to approximate a sine function between 0 and 3.14 radians. So will I have 1 input neuron? If so, then next if I assume K neurons in the hidden layer and each of which uses a sigmoid transfer function. Then in the output neuron(if say it just uses a linear sum of results from hidden layer) how can be output be something other than sigmoid shape? Shouldn't the linear sum be sigmoid as well? Or in short how can a sine function be approximated using this architecture in a Neural network.
It is possible and it is formally stated as the universal approximation theorem. It holds for any non-constant, bounded, and monotonically-increasing continuous activation function
I actually don't know the formal proof but to get an intuitive idea that it is possible I recommend the following chapter: A visual proof that neural nets can compute any function
It shows that with the enough hidden neurons and the right parameters you can create step functions as the summed output of the hidden layer. With step functions it is easy to argue how you can approximate any function at least coarsely. Now to get the final output correct the sum of the hidden layer has to be since the final neuron then outputs: . And as already said, we are be able to approximate this at least to some accuracy.
Can I compute the function f(x) = sqr(x) using opencv ANN ?
I need to train my ann by using set of integers and their square values.
I need to get squared value of a integer as output from ann model.
If we can do that using opencv ann, what will be the number input neurons, output neurons and how to specify the classes etc.. ??
You mention class specification, but I don't think that this is a class categorization problem. I think it would be better to treat the input as X, and the output as sqr(X). Then this becomes a general function approximation problem.
There is an issue with this particular problem however. Neural networks aren't well suited for functions with unbounded input/output. The output of a neural network is usually limited to the range of its activation function, and the input value is usually scaled to some reasonable range. Assuming you are using the default activation (symmetrical sigmoid), your output is limited to (-1, 1). If you have a limited range of integers you want to use, you can still do this, but you'll have to scale the inputs and outputs accordingly.
If you use this method, there will be one input node, and one output node, corresponding to the scaled versions of X and sqr(X) respectively. OpenCV will try to take care of scaling for you automatically. It's probably best for you to trust this, UNLESS you are planning on providing multiple different sets of training data. The different sets may have different distributions, hence a different scale.
I am somewhat confused by the use of the term linear/non-linear when discussing neural networks. Can anyone clarify these 3 points for me:
Each node in a neural net is the weighted sum of inputs. This is a linear combination of inputs. So the value for each node (ignoring activation) is given by some linear function. I hear that neural nets are universal function approximators. Does this mean that, despite containing linear functions within each node, the total network is able to approximate a non-linear function as well? Are there any clear examples of how this works in practise?
An activation function is applied to the output of that node to squash/transform the output for further propagation through the rest of the network. Am I correct in interpreting this output from the activation function as the "strength" of that node?
Activation functions are also referred to as nonlinear functions. Where does the term non-linear come from? Because the input into activation is the result of linear combination of inputs into the node. I assume it's referring to the idea that something like the sigmoid function is a non-linear function? Why does it matter that the activation is non-linear?
1 Linearity
A neural network is only non-linear if you squash the output signal from the nodes with a non-linear activation function. A complete neural network (with non-linear activation functions) is an arbitrary function approximator.
Bonus: It should be noted that if you are using linear activation functions in multiple consecutive layers, you could just as well have pruned them down to a single layer due to them being linear. (The weights would be changed to more extreme values). Creating a network with multiple layers using linear activation functions would not be able to model more complicated functions than a network with a single layer.
2 Activation signal
Interpreting the squashed output signal could very well be interpreted as the strength of this signal (biologically speaking). Thought it might be incorrect to interpret the output strength as an equivalent of confidence as in fuzzy logic.
3 Non-linear activation functions
Yes, you are spot on. The input signals along with their respective weights are a linear combination. The non-linearity comes from your selection of activation functions. Remember that a linear function is drawn as a line - sigmoid, tanh, ReLU and so on may not be drawn with a single straight line.
Why do we need non-linear activation functions?
Most functions and classification tasks are probably best described by non-linear functions. If we decided to use linear activation functions we would end up with a much coarser approximation on a complex function.
Universal approximators
You can sometimes read in papers that neural networks are universal approximators. This implies that a "perfect" network could be fitted to any model/function you could throw at it, though configuring the perfect network (#nodes and #layers ++) is a non-trivial task.
Read more about the implications at this Wikipedia page.
I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.