I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.
Related
We know we need 4 things for building a machine learning algorithm:
A Dataset
A Model
A cost function
An optimization procedure
Taking the example of linear regression (y = m*x +q) we have two most common way of finding the best parameters: using ML or MSE as cost functions.
We hypotize data are Gaussian-distributed, using ML.
Is this assumption part of the model, also?
It it's not, why? Is it part of the cost function?
I can't see the "edge" of the model, in this case.
Is this assumption part of the model, also?
Yes it is. The ideas of different loss functions derived from the nature of the problem, consequently the nature of the model.
MSE by definition calculates for the mean of the squares of the errors (error means the difference between real y and predicted y) which in its turn will be high if the data is not Gaussian-Like distributed. Just imagine a few extreme values among the data, what will happen to the line slope and consequently the residual error?
It is worth mentioning the assumptions of Linear Regression:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
If it's not, why? Is it part of the cost function?
As far I have seen, the assumption is not directly related to the cost function itself, rather related -as above-mentioned- to the model itself.
For example, Support Vector Machine idea is separation of classes. That’s finding out a line/ hyper-plane (in multidimensional space that separate outs classes), thus its cost function is Hinge Loss to "maximum-margin" of classification.
On the other hand, Logistic Regression uses Log-Loss (related to cross-entropy) because the model is binary and works on the probability of the output (0 or 1). And the list goes on...
The assumption that the data is Gaussian-distributed is part of the model in the sense that, for Gaussian distributed data the minimal Mean Squared Error also yields the maximum liklelihood solution for the data, given the model parameters. (Common proof, you can look it up if you are interested).
So you could say that the Gaussian distribution assumption justifies the choice of least squares as the loss function.
I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?
I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.
Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.
Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.
Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.
The motivating idea behind neural nets seems to be that they learn the "right" features to apply logistic regression to. Is there a similar approach for linear regression? (or just regression problems in general?)
Would doing the obvious thing of removing the application of a sigmoid function for all neurons (ie, including the hidden layers) make sense/work? (ie, each neuron is performing linear regression instead of logistic regression).
Alternatively, would doing the (maybe even more obvious) thing of just scaling output values to [0,1] work? (intuitively I would think not, as the sigmoid function seems like it would cause the net to arbitrarily favor extreme values) (edit: though I was just searching around some more, and saw that one technique is to scale based on mean and variance, which seems like it might deal with this issue -- so maybe this is more viable than I thought).
Or is there some other technique for doing "feature learning" for regression problems?
Check out this applet. Try to learn different functions. When you dictate linear activation functions at both hidden and output layers, it even fails to learn the quadratic function. At least one layer needs to be set to sigmoid function, see figures below.
There are different kinds of scaling. Standard scaling, as you mentioned, eliminates the impact of mean and standard deviation of the training sample, is most often used in machine learning. Just make sure you are using the same mean and std value from training sample in the test sample.
The reason why scaling is required is because the output of sigmoid function ranges at (0,1). I didn't try, but I think it is better to scale the output even if you select linear function at output layer. Otherwise large input at hidden layer (with sigmoid) won't lead to drastic output (the sigmoid function is approximately linear when the input is at a small range, out of such range will make the output changes much slowly). You can try this by yourself in your own data.
Besides, if you have various features, the feature normalization that makes different features in the same scale is also recommended. The scaling speeds up gradient descent by avoiding many extra iterations that are required when one or more features take on much larger values than the rest.
As #Ray mentioned, deep learning that many levels of features are involved can help you with the feature learning, it's not all linear combinations though.