what actually the sigmoid derivative purpose? - machine-learning

Honestly im learning the neural network but i have a question in the activation part.
I know that the question is general and a lot of explanation around the internet. But i still don't understand clearly.
Why we need to derivate the sigmoid function? why do not we just use
it?
It will be good if you give the clear explanation. Thankyou.
I've seen many videos on youtube, i've read many article about it but still don't get it.
Thanks for your help.

Your question is not entirely clear, but I assume you are asking: "Why don't we just use the Sigmoid function without having to calculate its derivative?".
Your question is also very broad, so my answer is very broad and wordy, you will need to read more to understand all the details, for which I'll try to provide links.
Activation function: as the name suggests, we are wanting to know if a given node is "on" or "off", for which the sigmoid function provides an easy way to turn continuous variables (X) into a range of {0,1}.
Use cases can vary and this function has certain properties, and so that is why there are many alternative "activation" functions, like tanh, ReLU, etc. Read more here: https://en.wikipedia.org/wiki/Sigmoid_function
Differentiate (derivate): most models we want to find the best-fit beta parameters for all our activation functions. To do this we, we typically want to minimise a "cost" function that describes how good our model is at predicting observed data. One way to solve this optimisation problem is Gradient Descent. Each step of gradient descent updates the parameters by following the multi-dimensional cost-function space. To do this, it needs the gradient of the activation function. This is important for back propagation that uses gradient descent to optimise the network, it requires that the activation functions you use (in most cases) to be differentiateable.
Read more here: https://en.wikipedia.org/wiki/Gradient_descent
I suggest if you have a deeper question that you take it to one of the machine learning stackexchange sites.

Related

Why shouldn't we use multiple activation functions in the same layer?

I'm a newbie when in comes to ML and neural nets, have been studying mainly online through coursera videos and a bit of kaggle/github. All the examples or cases where I've seen neural networks being applied have one thing in common - they use a specific type of activation function in all the nodes pertaining to a specific layer.
From what I understand, each node uses non-linear activation functions to learn about a particular pattern in the data. If it were so, why not use multiple types of activation functions?
I did find one link, which basically says that it's easier to manage a network if we use just one activation function per layer. Any other benefits?
Purpose of an activation function is to introduce non-linearity to a neural network. See this answer for more insight on why our deep neural networks would not actually be deep without non-linearity.
Activation functions do their job by controlling the outputs of the neurons. Sometimes they provide a simple threshold like ReLU does, which can be coded as following:
if input > 0:
return input
else:
return 0
And some other times they behave in more complicated ways such as tanh(x) or sigmoid(x). See this answer for more on different sorts of activations.
I also would like to add that I agree with #Joe, an activation function does not learn a particular pattern, it effects the way that a neural network learns multiple patterns. Each activation function have its own kind of effect on the output.
Thus, one benefit of not using multiple activation functions in a single layer would be predictability of their effect. We know what ReLU or Sigmoid does to the output of a convolutional filter for example. But do we now the effect of their cascaded use? In which order btw, does ReLU come first, or is it better for us to use Sigmoid first? Does it matter?
If we want to benefit from the combination of activation functions, all of these questions (and maybe many more) need to be answer with scientific evidences. Tedious experiments and evaluations should be done to get some meaningful results. Only then we would now what does it mean to use them together and after that, maybe a new type of activation function will arise and there will be a new name for it.

Supervised learning linear regression

I am confused about how linear regression works in supervised learning. Now I want to generate a evaluation function for a board game using linear regression, so I need both the input data and output data. Input data is my board condition, and I need the corresponding value for this condition, right? But how can I get this expected value? Do I need to write an evaluation function first by myself? But I thought I need to generate an evluation function by using linear regression, so I'm a little confused about this.
It's supervised-learning after all, meaning: you will need input and output.
Now the question is: how to obtain these? And this is not trivial!
Candidates are:
historical-data (e.g. online-play history)
some form or self-play / reinforcement-learning (more complex)
But then a new problem arises: which output is available and what kind of input will you use.
If there would be some a-priori implemented AI, you could just take the scores of this one. But with historical-data for example you only got -1,0,1 (A wins, draw, B wins) which makes learning harder (and this touches the Credit Assignment problem: there might be one play which made someone lose; it's hard to understand which of 30 moves lead to the result of 1). This is also related to the input. Take chess for example and take a random position from some online game: there is the possibility that this position is unique over 10 million games (or at least not happening often) which conflicts with the expected performance of your approach. I assumed here, that the input is the full board-position. This changes for other inputs, e.g. chess-material, where the input is just a histogram of pieces (3 of these, 2 of these). Now there are much less unique inputs and learning will be easier.
Long story short: it's a complex task with a lot of different approaches and most of this is somewhat bound by your exact task! A linear evaluation-function is not super-uncommon in reinforcement-learning approaches. You might want to read some literature on these (this function is a core-component: e.g. table-lookup vs. linear-regression vs. neural-network to approximate the value- or policy-function).
I might add, that your task indicates the self-learning approach to AI, which is very hard and it's a topic which somewhat gained additional (there was success before: see Backgammon AI) popularity in the last years. But all of these approaches are highly complex and a good understanding of RL and the mathematical-basics like Markov-Decision-Processes are important then.
For more classic hand-made evaluation-function based AIs, a lot of people used an additional regressor for tuning / weighting already implemented components. Some overview at chessprogramming wiki. (the chess-material example from above might be a good one: assumption is: more pieces better than less; but it's hard to give them values)

simple perceptron model and XOR

Sorry that i only keep asking here. I will study hard to get ready to answer questions too!
Many papers and articles claim that there is no restriction on choosing activation functions for MLP.
It seems like it is only matter which one fits most for given condition.
And also the articles say that it is mathematically proven simple perceptron can not solve XOR problem.
I know that simple perceptron model used to use step function for its activation function.
But if basically it doesn't matter which activation function to use, then using
f(x)=1 if |x-a|<b
f(x)=0 if |x-a|>b
as an activation function works on XOR problem. (for 2input 1output no hidden layer perceptron model)
I know that using artificial functions is not good for learning model. But if it works anyway, then why the articles say that it is proven it doesn't work?
Does the article means simple perceptron model by one using step function? or does activation function for simple perceptron has to be step function unlike MLP? or am i wrong?
In general,
The problem is that non-differentiable activation functions (like the one you proposed) cannot be used for back-propagation and other techniques. Back propagation is a convenient way to estimate the correct threshold values (a and b in your example). All the popular activation functions are selected such that they approximate step behaviour while remaining differentiable.
As bgbg mentioned, your activation is non-differentiable. If you use a differentiable activation function , which is required for MLP's to compute the gradients and update the weights, then the perceptron is simply fitting a line, which intuitively cannot solve the nonlinear XOR problem.

Things to try when Neural Network not Converging

One of the most popular questions regarding Neural Networks seem to be:
Help!! My Neural Network is not converging!!
See here, here, here, here and here.
So after eliminating any error in implementation of the network, What are the most common things one should try??
I know that the things to try would vary widely depending on network architecture.
But tweaking which parameters (learning rate, momentum, initial weights, etc) and implementing what new features (windowed momentum?) were you able to overcome some similar problems while building your own neural net?
Please give answers which are language agnostic if possible. This question is intended to give some pointers to people stuck with neural nets which are not converging..
If you are using ReLU activations, you may have a "dying ReLU" problem. In short, under certain conditions, any neuron with a ReLU activation can be subject to a (bias) adjustment that leads to it never being activated ever again. It can be fixed with a "Leaky ReLU" activation, well explained in that article.
For example, I produced a simple MLP (3-layer) network with ReLU output which failed. I provided data it could not possibly fail on, and it still failed. I turned the learning rate way down, and it failed more slowly. It always converged to predicting each class with equal probability. It was all fixed by using a Leaky ReLU instead of standard ReLU.
If we are talking about classification tasks, then you should shuffle examples before training your net. I mean, don't feed your net with thousands examples of class #1, after thousands examples of class #2, etc... If you do that, your net most probably wouldn't converge, but would tend to predict last trained class.
I had faced this problem while implementing my own back prop neural network. I tried the following:
Implemented momentum (and kept the value at 0.5)
Kept the learning rate at 0.1
Charted the error, weights, input as well as output of each and every neuron, Seeing the data as a graph is more helpful in figuring out what is going wrong
Tried out different activation function (all sigmoid). But this did not help me much.
Initialized all weights to random values between -0.5 and 0.5 (My network's output was in the range -1 and 1)
I did not try this but Gradient Checking can be helpful as well
If the problem is only convergence (not the actual "well trained network", which is way to broad problem for SO) then the only thing that can be the problem once the code is ok is the training method parameters. If one use naive backpropagation, then these parameters are learning rate and momentum. Nothing else matters, as for any initialization, and any architecture, correctly implemented neural network should converge for a good choice of these two parameters (in fact, for momentum=0 it should converge to some solution too, for a small enough learning rate).
In particular - there is a good heuristic approach called "resillient backprop" which is in fact parameterless appraoch, which should (almost) always converge (assuming correct implementation).
after you've tried different meta parameters (optimization / architecture), the most probable place to look at is - THE DATA
as for myself - to minimize fiddling with meta parameters, i keep my optimizer automated - Adam is by opt-of-choice.
there are some rules of thumb regarding application vs architecture... but its really best to crunch those on your own.
to the point:
in my experience, after you've debugged the net (the easy debugging), and still don't converge or get to an undesired local minima, the usual suspect is the data.
weather you have contradictory samples or just incorrect ones (outliers), a small amount can make the difference from say 0.6-acc to (after cleaning) 0.9-acc..
a smaller but golden (clean) dataset is much better than a big slightly dirty one...
with augmentation you can tweak results even further.

How to adjust weights - backpropagation [duplicate]

This question already has answers here:
How does a back-propagation training algorithm work?
(4 answers)
Closed 3 years ago.
I decided to make genetic algorithm to train neural networks. They will develop through inheritance, where one (of the many) variable gene should be transfer function.
So, I need to go more into depth of mathematics, and it is really time consumming.
I have for example three variants of transfer function gene.
1)log sigmoid function
2)tan sigmoid function
3)gaussian function
One of the features of the transfer function gene should be that it can modify parameters of function to get different shape of function.
And now, the problem that I am not cappable to solve yet:
I have error at output of neural network, and how to transfer it on the weights throug different functions with different parameters? According to my research I think it has something to do with derivatives and gradient descent.
I am high level math noob. Can someone explain me on simple example how to propagate error back on weights through parametrized (for exapmle) sigmoid function?
EDIT
I am still doing research, and now I am not sure if I not misunderstand backpropagation. I found this doc
http://www.google.cz/url?sa=t&rct=j&q=backpropagation+algorithm+sigmoid+examples&source=web&cd=10&ved=0CHwQFjAJ&url=http%3A%2F%2Fwww4.rgu.ac.uk%2Ffiles%2Fchapter3%2520-%2520bp.pdf&ei=ZF9CT-7PIsak4gTRypiiCA&usg=AFQjCNGWZjabH5ALbDLgSOBak-BTRGmS3g
and they have some example of computing weights, where they do NOT involve transfer function into weight adjusting.
So is it not neccessary to involve transfer functions into weight adjusting?
Backpropagation does indeed have something to do with derivatives and gradient descents.
I don't think there is any shortcut to truly understanding the math, but this may help-- I wrote it for someone else with basically the same question, and should at least explain at a high level what's going on, and why.
How does a back-propagation training algorithm work?

Resources