simple perceptron model and XOR - machine-learning

Sorry that i only keep asking here. I will study hard to get ready to answer questions too!
Many papers and articles claim that there is no restriction on choosing activation functions for MLP.
It seems like it is only matter which one fits most for given condition.
And also the articles say that it is mathematically proven simple perceptron can not solve XOR problem.
I know that simple perceptron model used to use step function for its activation function.
But if basically it doesn't matter which activation function to use, then using
f(x)=1 if |x-a|<b
f(x)=0 if |x-a|>b
as an activation function works on XOR problem. (for 2input 1output no hidden layer perceptron model)
I know that using artificial functions is not good for learning model. But if it works anyway, then why the articles say that it is proven it doesn't work?
Does the article means simple perceptron model by one using step function? or does activation function for simple perceptron has to be step function unlike MLP? or am i wrong?

In general,
The problem is that non-differentiable activation functions (like the one you proposed) cannot be used for back-propagation and other techniques. Back propagation is a convenient way to estimate the correct threshold values (a and b in your example). All the popular activation functions are selected such that they approximate step behaviour while remaining differentiable.

As bgbg mentioned, your activation is non-differentiable. If you use a differentiable activation function , which is required for MLP's to compute the gradients and update the weights, then the perceptron is simply fitting a line, which intuitively cannot solve the nonlinear XOR problem.

Related

Why shouldn't we use multiple activation functions in the same layer?

I'm a newbie when in comes to ML and neural nets, have been studying mainly online through coursera videos and a bit of kaggle/github. All the examples or cases where I've seen neural networks being applied have one thing in common - they use a specific type of activation function in all the nodes pertaining to a specific layer.
From what I understand, each node uses non-linear activation functions to learn about a particular pattern in the data. If it were so, why not use multiple types of activation functions?
I did find one link, which basically says that it's easier to manage a network if we use just one activation function per layer. Any other benefits?
Purpose of an activation function is to introduce non-linearity to a neural network. See this answer for more insight on why our deep neural networks would not actually be deep without non-linearity.
Activation functions do their job by controlling the outputs of the neurons. Sometimes they provide a simple threshold like ReLU does, which can be coded as following:
if input > 0:
return input
else:
return 0
And some other times they behave in more complicated ways such as tanh(x) or sigmoid(x). See this answer for more on different sorts of activations.
I also would like to add that I agree with #Joe, an activation function does not learn a particular pattern, it effects the way that a neural network learns multiple patterns. Each activation function have its own kind of effect on the output.
Thus, one benefit of not using multiple activation functions in a single layer would be predictability of their effect. We know what ReLU or Sigmoid does to the output of a convolutional filter for example. But do we now the effect of their cascaded use? In which order btw, does ReLU come first, or is it better for us to use Sigmoid first? Does it matter?
If we want to benefit from the combination of activation functions, all of these questions (and maybe many more) need to be answer with scientific evidences. Tedious experiments and evaluations should be done to get some meaningful results. Only then we would now what does it mean to use them together and after that, maybe a new type of activation function will arise and there will be a new name for it.

Can gradient descent itself solve non-linear problem in ANN?

I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.

What do non-linear activation functions do at a fundamental level in neural networks?

I've been trying to find out what exactly non-linear activation functions do when implemented in a neural network.
I know they modify the output of a neuron, but how and for what purpose?
I know they add non-linearity to otherwise linear neural networks, but for what purpose?
What exactly do they do to the output of each layer? Is it some form of classification?
I want to know what exactly their purpose is within neural networks.
Wikipedia says that "the activation function of a node defines the output of that node given an input or set of inputs." This article states that the activation function checks whether a neuron has "fired" or not. I've looked at a bunch more articles and other questions on Stack Overflow as well, but none of them gave a satisfying answer as to what is occurring.
The main reason for using non-linear activation functions is to be able to learn non-linear target functions, i.e. learn a non-linear relationship between the inputs and outputs. If a network consists of only linear activation functions, it can only model a linear relationship between the inputs and outputs, which is not useful in almost all applications.
I am by no means an ML expert, so maybe this video can explain it better: https://www.coursera.org/lecture/neural-networks-deep-learning/why-do-you-need-non-linear-activation-functions-OASKH
Hope this helps!
First of all it's better to have a clear idea on why we use activation functions.
We use activation functions to propagate the output of one layer’s nodes to
the next layer. Activation functions are scalar-to-scalar functions and we use activation functions for hidden neurons in a neural network to introduce non-linearity into the network’s model. So in a simpler level, activation function are used to introduce non-linearity into the network.
So what is the use of introducing non-linearity? Before that, non-linearity means that an output cannot be reproduced from a linear combination of the inputs. Therefore without a non-linear activation function in a neural-network, even though it may have hundreds of hidden layers it would still behave like a single-layer perceptron. The reason is whichever the way you sum them, it would only result a linear output.
Anyhow for more deeper level understanding, I suggest you to look at this Medium post as well as this video by Andrew Ng himself.
From the Andrew Ng's video let me rephrase some important parts below.
...if you don't have an activation function, then no matter how many
layers your neural network has, all it's doing is just computing a
linear activation function. So you might as well not have any hidden
layers.
...it turns out that if you have a linear activation function here and
a sigmoid function here, then this model is no more expressive than
standard logistic regression without any hidden layer.
...so unless
you throw a non-linear in there, then you're not computing more
interesting functions even as you go deeper in the network.

Machine Learning, After training, how exactly does it get a prediction? opencv

So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.
I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.

How to adjust weights - backpropagation [duplicate]

This question already has answers here:
How does a back-propagation training algorithm work?
(4 answers)
Closed 3 years ago.
I decided to make genetic algorithm to train neural networks. They will develop through inheritance, where one (of the many) variable gene should be transfer function.
So, I need to go more into depth of mathematics, and it is really time consumming.
I have for example three variants of transfer function gene.
1)log sigmoid function
2)tan sigmoid function
3)gaussian function
One of the features of the transfer function gene should be that it can modify parameters of function to get different shape of function.
And now, the problem that I am not cappable to solve yet:
I have error at output of neural network, and how to transfer it on the weights throug different functions with different parameters? According to my research I think it has something to do with derivatives and gradient descent.
I am high level math noob. Can someone explain me on simple example how to propagate error back on weights through parametrized (for exapmle) sigmoid function?
EDIT
I am still doing research, and now I am not sure if I not misunderstand backpropagation. I found this doc
http://www.google.cz/url?sa=t&rct=j&q=backpropagation+algorithm+sigmoid+examples&source=web&cd=10&ved=0CHwQFjAJ&url=http%3A%2F%2Fwww4.rgu.ac.uk%2Ffiles%2Fchapter3%2520-%2520bp.pdf&ei=ZF9CT-7PIsak4gTRypiiCA&usg=AFQjCNGWZjabH5ALbDLgSOBak-BTRGmS3g
and they have some example of computing weights, where they do NOT involve transfer function into weight adjusting.
So is it not neccessary to involve transfer functions into weight adjusting?
Backpropagation does indeed have something to do with derivatives and gradient descents.
I don't think there is any shortcut to truly understanding the math, but this may help-- I wrote it for someone else with basically the same question, and should at least explain at a high level what's going on, and why.
How does a back-propagation training algorithm work?

Resources