I'm training a neural network on data that comes in as negative & positive values.
Is there any way to feed the data into a ReLU network without converting it all to positive and having a separate input which says if the data is negative or positive?
The problem I see is that a negative input at the input layer means that unless you have initialised your weights to be negative, the ReLU node isn't ever activated and is forever dead.
I'm not really 100% sure what you're asking, as there are many activation functions and you can easy code your own. If you dont want to code your own, maybe try some alternatives:
Leaky ReLU
Parameteric ReLU
Basically, take a look here
If you really use an activation function with the input layer, I would suggest either using another activation function like ELU or transform your data to the range [0,1], for example.
If the ReLU function is in some hidden layer, the ReLU function should become dead only temporarily.
Suppose you have a ReLU function in the last hidden layer of a feed-forward network. With the backpropagation algorithm it should be possible that the outputs of the previous hidden layers are changed in such a way that, eventually, the input to the ReLU function will become positive again. Then the ReLU would not be dead anymore. Chances are that I am missing something here.
Anyway, you should definitely give ELU a try! I have experienced better results with it than with the ReLU.
Related
why is log-sigmoid activation function the primary selection in the hidden layer instead of tanh-sigmoid activation function? And also, if I use Z-score normalization, could I use sigmoid activation function in the hidden layer?
Ancient history
The motivation for using the sigmoid function was historically physically motivated. The first neural networks, in the very early days, in fact used the step function
The motivation was that this is how neurons work in the brain, at least to the understanding of that time. At a certain fixed activation energy the neuron "activates", going from inactive (0) to active (1). However, these networks are very hard to train, and the standard paradigm was also physically motivated, e.g. "neurons that are used often, get a stronger connection". This worked for very small networks, but did not scale at all to larger networks.
Gradient descent and the advent of the sigmoid
In the 80's a slight revolution was had in neural networks when it was discovered that they can be trained using gradient descent. This allowed the networks to scale to much larger scales, but it also spelled the end of the step activation, since it is not differentiable. However, given the long history of the step activation and its plausible physical motivation, people were hesitant to abandon it fully, and hence approximated it by the sigmoid function, which shares many of its characteristics, but is differentiable around 0.
Later on, people started using the tanh function since it is zero centered, which gives somewhat better characteristics in some cases.
The RevoLUtion
Then in 2000, a seminal paper was published in Nature that suggested the use of the the ReLU activation function:
This was motivated by problems with the earlier actiation functions, but most important is speed and the fact that it does not suffer from the vanishing gradient problem. Since then, basically all top neural network research has been using the ReLU activation or slight variations thereof.
The only exception is perhaps recurrent networks, where the output is fed back as input. In these, using the unbounded actiation functions such as the ReLU would quickly lead to an explosion in results, and people still use the sigmoid and/or tanh in these cases.
My question is based on the understanding from https://www.youtube.com/watch?v=oYbVFhK_olY&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=43
In Neural Network a Neuron is activated by a threshold(activation) function, which in the above example is a sigmoid function. For regression problems do we need an activation function?
regards
Souvik
Yes. It requires activation function in neural network regression too. If one is regressing neural network to the continuous range output, sigmoid like function can be used as activation functions. Avoid threshold-like activation function. Apart from output node, it will also require to have activation function across all the intermediate nodes of input and hidden layers.
For the hidden layers, you should use Relu and don't use activation function for the output layer.
Please check this document: Linear Regression
let's get some terminology clear:
by "regression" we usually refer to the simplest model where each of the independent variables have a parameter, and we multiply the variable with the parameter. In linear regression we stop there, with a continuous output. in logistic regression we also use an "activation function", probably the sigmoid, to get the output to be in the range (0,1), and then we can use a threshold (like x>0.5) to decide whether the output is 0 or 1.
So regession is like a neural network with 0 hidden layers.
Now I assume that what you meant is that you want the neural network to have a continuous output, like linear regression. so naturally we don't want to use any "distorting" function on the output layer.
However, we do want to use an activation function on the neurons in the hidden layers. the reason behind this is that we want the network to be able to break the linearity - so the network will be able to do something more interesting than just multiplying the variables by different parameters, allow it to combine them and different ways. The idea is to allow the network to simulate any function, even non linear ones.
So which activation should we choose to break the linearity? there are numerous options... sigmoid, tanh, relu, leaky-relu, elu, and many many more. the most common one today is relu, but this can change... it's mostly an empirical study of which ones work best and allow for the fastest learning. And it is up to the neural network architect to decide which function they want to use. As you get to know more of these functions and their pros and cons you'll be able to try a few for every problem and see what works best for you.
I am try to write a neural network class but I don't fully understand some aspects of it. I have two questions on the folling design.
Am I doing this correctly? Does the bias neuron need to connect to all of neurons (except those in the input layer) or just those in the hidden layer?
My second question is about calculation the output value. I'm using the equation below to calculate the output value of the neurons.
HiddenLayerFirstNeuron.Value =
(input1.Value * weight) + (input2.Value * weight) + (Bias.Value * weight)
After this equation, I'm calculating the activation and the result send the output. And output neurons doing same.
I'm not sure what I am do and I want to clear up problems.
Take a look at: http://deeplearning.net/tutorial/contents.html in theano. This explains everything you need to know for multi layer perceptron using theano (symbolic mathematic library).
The bias is usually connected to all hidden and output units.
Yes, you compute the input of activation function like summation of weight*output of previous layer neuron.
Good luck with development ;)
There should be a separate bias neuron for each hidden and the output layer. Think of the layers as a function applied to a first order polynomials such as f(m*x+b)=y where y is your output and f(x) your activation function. If you look at the the linear term you will recognize the b. This represents the bias and it behaves similar with neural network as with this simplification: It shifts the hyperplane up and down the in the space. Keep in mind that you will have one bias per layer connected to all neurons of that layer f((wi*xi+b)+...+(wn*xn+b)) with an initial value of 1. When it comes to gradient descent, you will have to train this neuron like a normal weight.
In my opinion should you apply the activation function to the output layer as well. This is how it's usually done with multilayer perceptrons. But it actually depends of what you want. If you, for example, use the logistic function as activation function and you want an output in the interval (0,1), then you have to apply your activation function to the output as well. Since a basic linear combination, as it is in your example, can theoretically go above the boundaries of the previously mentioned Intervall.
I am learning neural networks for the first time. I was trying to understand how using a single hidden layer function approximation can be performed. I saw this example on stackexchange but I had some questions after going through one of the answers.
Suppose I want to approximate a sine function between 0 and 3.14 radians. So will I have 1 input neuron? If so, then next if I assume K neurons in the hidden layer and each of which uses a sigmoid transfer function. Then in the output neuron(if say it just uses a linear sum of results from hidden layer) how can be output be something other than sigmoid shape? Shouldn't the linear sum be sigmoid as well? Or in short how can a sine function be approximated using this architecture in a Neural network.
It is possible and it is formally stated as the universal approximation theorem. It holds for any non-constant, bounded, and monotonically-increasing continuous activation function
I actually don't know the formal proof but to get an intuitive idea that it is possible I recommend the following chapter: A visual proof that neural nets can compute any function
It shows that with the enough hidden neurons and the right parameters you can create step functions as the summed output of the hidden layer. With step functions it is easy to argue how you can approximate any function at least coarsely. Now to get the final output correct the sum of the hidden layer has to be since the final neuron then outputs: . And as already said, we are be able to approximate this at least to some accuracy.
Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.