This question already has answers here:
How does a back-propagation training algorithm work?
(4 answers)
Closed 3 years ago.
I decided to make genetic algorithm to train neural networks. They will develop through inheritance, where one (of the many) variable gene should be transfer function.
So, I need to go more into depth of mathematics, and it is really time consumming.
I have for example three variants of transfer function gene.
1)log sigmoid function
2)tan sigmoid function
3)gaussian function
One of the features of the transfer function gene should be that it can modify parameters of function to get different shape of function.
And now, the problem that I am not cappable to solve yet:
I have error at output of neural network, and how to transfer it on the weights throug different functions with different parameters? According to my research I think it has something to do with derivatives and gradient descent.
I am high level math noob. Can someone explain me on simple example how to propagate error back on weights through parametrized (for exapmle) sigmoid function?
EDIT
I am still doing research, and now I am not sure if I not misunderstand backpropagation. I found this doc
http://www.google.cz/url?sa=t&rct=j&q=backpropagation+algorithm+sigmoid+examples&source=web&cd=10&ved=0CHwQFjAJ&url=http%3A%2F%2Fwww4.rgu.ac.uk%2Ffiles%2Fchapter3%2520-%2520bp.pdf&ei=ZF9CT-7PIsak4gTRypiiCA&usg=AFQjCNGWZjabH5ALbDLgSOBak-BTRGmS3g
and they have some example of computing weights, where they do NOT involve transfer function into weight adjusting.
So is it not neccessary to involve transfer functions into weight adjusting?
Backpropagation does indeed have something to do with derivatives and gradient descents.
I don't think there is any shortcut to truly understanding the math, but this may help-- I wrote it for someone else with basically the same question, and should at least explain at a high level what's going on, and why.
How does a back-propagation training algorithm work?
Related
Honestly im learning the neural network but i have a question in the activation part.
I know that the question is general and a lot of explanation around the internet. But i still don't understand clearly.
Why we need to derivate the sigmoid function? why do not we just use
it?
It will be good if you give the clear explanation. Thankyou.
I've seen many videos on youtube, i've read many article about it but still don't get it.
Thanks for your help.
Your question is not entirely clear, but I assume you are asking: "Why don't we just use the Sigmoid function without having to calculate its derivative?".
Your question is also very broad, so my answer is very broad and wordy, you will need to read more to understand all the details, for which I'll try to provide links.
Activation function: as the name suggests, we are wanting to know if a given node is "on" or "off", for which the sigmoid function provides an easy way to turn continuous variables (X) into a range of {0,1}.
Use cases can vary and this function has certain properties, and so that is why there are many alternative "activation" functions, like tanh, ReLU, etc. Read more here: https://en.wikipedia.org/wiki/Sigmoid_function
Differentiate (derivate): most models we want to find the best-fit beta parameters for all our activation functions. To do this we, we typically want to minimise a "cost" function that describes how good our model is at predicting observed data. One way to solve this optimisation problem is Gradient Descent. Each step of gradient descent updates the parameters by following the multi-dimensional cost-function space. To do this, it needs the gradient of the activation function. This is important for back propagation that uses gradient descent to optimise the network, it requires that the activation functions you use (in most cases) to be differentiateable.
Read more here: https://en.wikipedia.org/wiki/Gradient_descent
I suggest if you have a deeper question that you take it to one of the machine learning stackexchange sites.
Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
A simple example: Given an input sequence, I want the neural network to output the median of the sequence. The problem is, if a neural network learnt to compute the median of n inputs, how can it compute the median of even more inputs? I know that recurrent neural networks can learn functions like max and parity over a sequence, but computing these functions only requires constant memory. What if the memory requirement grows with the input size like computing the median?
This is a follow up question on How are neural networks used when the number of inputs could be variable?.
One idea I had is the following: treating each weight as a function of the number of inputs instead of a fixed value. So a weight may have many parameters that define a function, and we train these parameters. For example, if we want the neural network to compute the average of n inputs, we would like each weight function behaves like 1/n. Again, average per se can be computed using recurrent neural networks or hidden markov model, but I was hoping this kind of approaches can be generalized to solve certain problems where memory requirement grows.
If a neural network learnt to compute the median of n inputs, how can it compute the median of even more inputs?
First of all, you should understand the use of a neural network. We, generally use the neural network in problems where a mathematical solution is not possible. In this problem, use of NN is not significant/ unadvisable.
There are other problems of such nature, like forecasting, in which continuous data arrives over time.
One solution to such problem can be Hidden Markov Model (HMM). But again, such models depends on the correlation between input over a period of time. So This model is not efficient for problems where the input is completely random.
So, If input is completely random and memory requirement grows
There is nothing much you can do about it, one possible solution could be growing your memory size.
Just remember one thing, NN and similar models of machine learning aims to extract meaningful information from the data. if data is just some random values then all models will generate some random output.
One more idea: some data transformation. Let have N big enough that always bigger than n. We make a net with 2*N inputs. First N inputs are for data. If n less then N, then rest inputs set to 0. Last N inputs are intended for specifying which numbers are useful. Thus 1 is data, 0 is not data. As follows in Matlab notation: if v is an input, and it is a vector of length 2*N, then we put into v(1:n) our original data. After that, we put to v(n+1:N) zeros. Then put to v(N+1:N+n) ones, and then put V(N+n+1:2*N) zeros. It is just an idea, which I have not checked. If you are interested in the application of neural networks, take a look at the example of how we have chosen an appropriate machine learning algorithm to classify EEG signals for BCI.
So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.
I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.
Sorry that i only keep asking here. I will study hard to get ready to answer questions too!
Many papers and articles claim that there is no restriction on choosing activation functions for MLP.
It seems like it is only matter which one fits most for given condition.
And also the articles say that it is mathematically proven simple perceptron can not solve XOR problem.
I know that simple perceptron model used to use step function for its activation function.
But if basically it doesn't matter which activation function to use, then using
f(x)=1 if |x-a|<b
f(x)=0 if |x-a|>b
as an activation function works on XOR problem. (for 2input 1output no hidden layer perceptron model)
I know that using artificial functions is not good for learning model. But if it works anyway, then why the articles say that it is proven it doesn't work?
Does the article means simple perceptron model by one using step function? or does activation function for simple perceptron has to be step function unlike MLP? or am i wrong?
In general,
The problem is that non-differentiable activation functions (like the one you proposed) cannot be used for back-propagation and other techniques. Back propagation is a convenient way to estimate the correct threshold values (a and b in your example). All the popular activation functions are selected such that they approximate step behaviour while remaining differentiable.
As bgbg mentioned, your activation is non-differentiable. If you use a differentiable activation function , which is required for MLP's to compute the gradients and update the weights, then the perceptron is simply fitting a line, which intuitively cannot solve the nonlinear XOR problem.