I am new to machine learning and am learning to program a Perceptron.
What is the derivative of the heavside functions? For context, I am using the perceptron as a pseudo SVM, so that I can classify datapoints.
I assume its 0, however based on tutorial I see people use 1. Why does this work. Also, is it possible to use a sigmoid type activation function then pass the output through a heavside function, therefore I can take the derivative of the sigmoid?
Thanks.
The derivative of the Heaviside step function is zero everywhere except at the branching point which is at zero since it does not exist there. This is so because the Heaviside function is composed of two constant functions on different intervals and the derivative of a constant function is always zero. The derivative doesn't exist at zero since the function is not continuous there (a jump exists at zero).
Numerically speaking, the derivative is set to be zero everywhere even at zero. This is problematic since gradient descent will not work because weights will not be updated and will remain the same from the time of initialization. To make gradient descent work you will need a smoothed version of the Heaviside function. One possibility is simply the sigmoid function.
About the second question, yes you can define a custom function with a custom gradient in many frameworks such as TensorFlow. You can set the output to be the Heaviside function's output and the gradient to be the sigmoid's function gradient.
Related
NOTE: when you see (0) in the functions it represents Theta not Zero
I've been studying Andrew Ng's Machine Learning Course, and I have the following inquery:
(Short Version: If one were to look at all the mathematical expressions/calculations used for both Forward AND Backward propagation, then it appears to me that we never use the Cost Function directly, but its Derivative , so what is the importance of the cost function and its choice anyway? is it purely to evaluate our system whenever we feel like it?)
Andrew mentioned that for Logistic Regression, using the MSE (Mean Squared Error) Cost function
wouldn't be good, because applying it to our Sigmoid function would yield a non-convex cost function that has a lot of Local Optima, so it is best that we use the following logistic cost function:
Which will have 2 graphs (one for y=0 and one for y=1), both of which are convex.
My question is the following, since it is our objective to minimize the cost function (aka have the Derivative reach 0), which we achieve by using Gradient Descent, updating our weights using the Derivative of the Cost Function, which in both cases (both cost functions) is the same derivative:
dJ = (h0(x(i)) - y(i)) . x(i)
So how did the different choice of cost function in this case effect our algorithm in any way? because in forward propagation, all we need is
h0(x(i)) = Sigmoid(0Tx)
which can be calculated without ever needing to calculate the cost function, then in backward propagation and in updating the weights, we always use the derivative of the cost function, so when does the Cost Function itself come into play? is it just necessary when we want an indication of how well our network is doing? (then why not just depend on the derivative to know that)
The forward propagation does not need the cost function in any way because you just applying all your learned weights to the corresponding input.
The cost function is generally used to measure how good your algorihm is by comparing your models outcome (therefore applying your current weights to your input) with the true label of the input (in supervised algorithms). The main objective is therefore to minimize the cost function error as (in most cases) you want the difference of the prediction and the true label as small as possible. In optimization it is pretty helpful if your function you want to optimize is convex because it guarantees that if you find a local minimum it is at the same time the global minimum.
For minimizing the cost function, gradient descent is used to iteratively update your weights to get closer to the minimum. This is done w.r.t to the learned weights such that you are able to update your weights of the model for achieving the lowest possible costs. The backpropagation algorithm is used to adjust the weights using the cost function in the backward pass.
Technically, you are correct: we do not explicitly use the cost function in any of the calculations for forward propagation and back propagation.
You asked 'what is the importance of the cost function and its choice anyway?'. I have two answers:
The cost function is incredibly important because its gradient is what allows us to update our weights. Although we are only actually computing the gradient of the cost function and not the cost function itself, choosing a different cost function would mean we would have a different gradient, thus changing how we update our weights.
The cost function allows us to evaluate our model performance. It is common practice to plot cost vs epoch to understand how the cost decreases over time.
Your answer indicted you essentially understood all of this already but I hoped to clarify it a bit. Thanks!
I am learning neural networks for the first time. I was trying to understand how using a single hidden layer function approximation can be performed. I saw this example on stackexchange but I had some questions after going through one of the answers.
Suppose I want to approximate a sine function between 0 and 3.14 radians. So will I have 1 input neuron? If so, then next if I assume K neurons in the hidden layer and each of which uses a sigmoid transfer function. Then in the output neuron(if say it just uses a linear sum of results from hidden layer) how can be output be something other than sigmoid shape? Shouldn't the linear sum be sigmoid as well? Or in short how can a sine function be approximated using this architecture in a Neural network.
It is possible and it is formally stated as the universal approximation theorem. It holds for any non-constant, bounded, and monotonically-increasing continuous activation function
I actually don't know the formal proof but to get an intuitive idea that it is possible I recommend the following chapter: A visual proof that neural nets can compute any function
It shows that with the enough hidden neurons and the right parameters you can create step functions as the summed output of the hidden layer. With step functions it is easy to argue how you can approximate any function at least coarsely. Now to get the final output correct the sum of the hidden layer has to be since the final neuron then outputs: . And as already said, we are be able to approximate this at least to some accuracy.
It seems there is a bit of confusion between activation and transfer function. From Wikipedia ANN:
It seems that the transfer function calculates the net while the activation function the output of the neuron. But on Matlab documentation of an activation function I quote:
satlin(N, FP) is a neural transfer function. Transfer functions calculate a layer's output from its net input.
So who is right? And can you use the term activation function or transfer function interchangeably?
No, they are the same. I also quote from wikipedia: "Usually the sums of each node are weighted, and the sum is passed through a non-linear function known as an activation function or transfer function. Don't take the matlab documentation too literally, it's thousand of pages long so some words might not be used in their strict sense.
In machine learning at least, they are used interchangeably by all books I've read.
activation function is used almost exclusively nowadays.
transfer function is mostly used in older (80/90's) books, when machine learning was uncommon, and most readers had an electrical engineering/signal processing background.
So, to sum up
prefer the term activation function. It's more common, and more appropriate, both from a biological point of view (neuron fires when you surpass a threshold) and an engineering point of view (an actual transfer function should describe the whole system)
if anyone else makes a distinction between them, ask them to clear up what they mean
After some research I've found in "Survey of Neural Transfer Functions", from Duch and Jankowski (1999) that:
transfer_function = activation function + output function
And IMO the terminology makes sense now since we need to have a value (signal strength) to verify it the neuron will be activated and then compute an output from it. And what the whole process do is to transfer a signal from one layer to another.
Two functions determine the way signals are processed by neurons. The
activation function determines the total signal a neuron receives. The value of the activation function is usually scalar and the
arguments are vectors. The second function determining neuron’s
signal processing is the output function o(I), operating on scalar
activations and returning scalar values. Typically a squashing
function is used to keep the output values within specified bounds.
These two functions together determine the values of the neuron
outgoing signals. The composition of the activation and the output
function is called the transfer function o(I(x)).
I think the diagram is correct but not terminologically accurate.
The transfer function includes both the activation and transfer functions in your diagram. What is called transfer function in your diagram is usually referred to as the net input function. The net input function only adds weights to the inputs and calculates the net input, which is usually equal to the sum of the inputs multiplied by given weights. The activation function, which can be a sigmoid, step, etc. function, is applied to the net input to generate the output.
Transfer function come from the name transformation and are used for transformation purposes. On the other hand, activation function checks for the output if it meets a certain threshold and either outputs zero or one. Some examples of non-linear transfer functions are softmax and sigmoid.
For example, suppose we have continuous input signal x(t). This input signal is transformed into an output signal y(t) through a transfer function H(s).
Y(s) = H(s)X(s)
Transfer function H(s) as can be seen above changes the state of the input X(s) into a new output state Y(s) through transformation.
A closer look at H(s) shows that it can represents a weight in a neural network. Therefore, H(s)X(s) is simply the multiplication of the input signal and its weight. Several of these input-weight pairs in a given layer are then summed up to form the input of another layer. This means that input to any layer to a neural network is simply the transfer function of its input and the weight, i.e a linear transformation because the input is now transformed by the weights. But in the real world, problems are non-linear in nature. Therefore, to make the incoming data non-linear, we use a non-linear mapping called activation function. An activation function is a decision making function that determines the presence of particular neural feature. It is mapped between 0 and 1, where zero mean the feature is not there, while one means the feature is present. Unfortunately, the small changes occurring in the weights cannot be reflected in the activation value because it can only take either 0 or 1. Therefore, nonlinear finctions must be continuous and differentiable between this range.
In really sense before outputting an activation, you calculate the sigmoid first since it is continuous and differential and then use it as an input to an activation function which checks whether the output of the sigmoid is higher than its activation threshhold. A neural network must be able to take any input from -infinity to +positive infinite, but it should be able to map it to an output that ranges between {0,1} or between {-1,1} in some cases - thus the need for activation function.
I am also a newbie in machine learning field. From what I understand...
Transfer function:
Transfer function calculates the net weight, so you need to modify your code or calculation it need to be done before Transfer function. You can use various transfer function as suitable with you task.
Activation function: This is used for calculating threshold value i.e. when your network will give the output. If your calculated result is greater then threshold value it will show output otherwise not.
Hope this helps.
I am interested in trying NN in a perhaps unusual setting.
The input to the NN is a vector. The output is also a vector. However, the training data and error is not computed directly on this output vector, but is a (nonlinear) function of this output vector. So at each epoch, I need to activate the NN, find an output vector, apply this to my (external) nonlinear function to compute a new output vector. However, this new output vector is of length 1 and the error is computed based on just this single output.
Some questions:
Is this something that NN might usefully do?
Is this a structure that is well-known already?
Any ideas how to approach this?
In principle, yes.
Yes, this is what a softmax unit does. It takes the activations at the output layer and computes a single value from them, which is then used to compute the error.
You need to know the partial derivative of your multivariate function (let's call it f). From there, you can use the chain rule to compute the derivative of the error in the parameters of f and backpropagate the error derivative.
For a regression with some basis functions, I read that gaussian basis functions are local whereas polynomial basis functions are global. What does it mean ?
Thank you
A gaussian is centered around a certain value and tapers off to 0 as you get far away from it. In contrast, a polynomial extends over the whole range.
This means that a gaussian will model a local feature of the data (like a bump or valley), whereas a polynomial will model global patterns in the data (say, an overall downward or upward trend).
A local basis function (you will also often see referred to as compactly supported basis function) is essentially non-zero only on a particular interval. Examples of such functions used in approximation / regression are B-Splines, wavelets etc. Polynomials on the other hand, are non-zero everywhere apart from at their roots. Consider a least squares regression curve using monomial basis - your resulting vandermonde matrix will not exhibit any kind of structure - an element can only be zero if x=0. Now assume that you try the same problem with a BSpline curve with fixed knots. Now because of the fact that the basis functions are local, your matrix will be banded - each row will contain zero items because the effect of the basis function is only present on a certain interval.