Neural network input with exponential decay - machine-learning

Often, to improve learning rates, inputs to a neural network are preprocessed by scaling and shifting to be between -1 and 1. I'm wondering though if that's a good idea with an input whose graph would be exponentially decaying. For instance, if I had an input with integer values 0 to 100 distributed with the majority of inputs being 0 and smaller values being more common than large values, with 99 being very rare.
Seems that scaling them and shifting wouldn't be ideal, since now the most common value would be -1. How is this type of input best dealt with?

Consider you're using a sigmoid activation function which is symmetric around the origin:
The trick to speed up convergence is to have the mean of the normalized data set be 0 as well. The choice of activation function is important because you're not only learning weights from the input to the first hidden layer, i.e. normalizing the input is not enough: the input to the second hidden layer/output is learned as well and thus needs to obey the same rule to be consequential. In the case of non-input layers this is done by the activation function. The much cited Efficient Backprop paper by Lecun summarizes these rules and has some nice explanations as well which you should look up. Because there's other things like weight and bias initialization that one should consider as well.
In chapter 4.3 he gives a formula to normalize the inputs in a way to have the mean close to 0 and the std deviation 1. If you need more sources, this is great faq as well.
I don't know your application scenario but if you're using symbolic data and 0-100 is ment to represent percentages, then you could also apply softmax to the input layer to get better input representations. It's also worth noting that some people prefer scaling to [.1,.9] instead of [0,1]
Edit: Rewritten to match comments.

Related

RL Activation Functions with Negative Rewards

I have a question regarding appropriate activation functions with environments that have both positive and negative rewards.
In reinforcement learning, our output, I believe, should be the expected reward for all possible actions. Since some options have a negative reward, we would want an output range that includes negative numbers.
This would lead me to believe that the only appropriate activation functions would either be linear or tanh. However, I see any many RL papers the use of Relu.
So two questions:
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
Many RL papers indeed use Relu's for most layers, but typically not for the final output layer. You mentioned the Human Level Control through Deep Reinforcement Learning paper and the Hindsight Experience Replay paper in one of the comments, but neither of those papers describe architectures that use Relu's for the output layer.
In the Human Level Control through Deep RL paper, page 6 (after references), Section "Methods", last paragraph for the part on "Model architecture" mentions that the output layer is a fully-connected linear layer (not a Relu). So, indeed, all hidden layers can only have nonnegative activation levels (since they all use Relus), but the output layer can have negative activation levels if there are negative weights between the output layer and last hidden layer. This is indeed necessary because the outputs it should create can be interpreted as Q-values (which may be negative).
In the Hindsight Experience Replay paper, they do not use DQN (like the paper above), but DDPG. This is an "Actor-Critic" algorithm. The "critic" part of this architecture is also intended to output values which can be negative, similar to the DQN architecture, so this also cannot use a Relu for the output layer (but it can still use Relus everywhere else in the network). In Appendix A of this paper, under "Network architecture", it is also described that the actor output layer uses tanh as activation function.
To answer your specific questions:
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
Well, there are also other activations (leaky relu, sigmoid, lots of others probably). But a Relu indeed cannot result in negative outputs.
Not 100% sure, possibly. It would often be difficult though, if you have no domain knowledge about how big or small rewards (and/or returns) can possibly get. I have a feeling it would typically be easier to simply end with one fully connected linear layer.
If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
No, this is only the case for the activation function of the output layer. For all other layers, it does not matter because you can have negative weights which means neurons with only positive values can still contribute with negative values to the next layer.

Time Series Prediction using Recurrent Neural Networks

I am using a Bike Sharing dataset to predict the number of rentals in a day, given the input. I will use 2011 data to train and 2012 data to validate. I successfully built a linear regression model, but now I am trying to figure out how to predict time series by using Recurrent Neural Networks.
Data set has 10 attributes (such as month, working day or not, temperature, humidity, windspeed), all numerical, though an attribute is day (Sunday: 0, Monday:1 etc.).
I assume that one day can and probably will depend on previous days (and I will not need all 10 attributes), so I thought about using RNN. I don't know much, but I read some stuff and also this. I think about a structure like this.
I will have 10 input neurons, a hidden layer and 1 output neuron. I don't know how to decide on how many neurons the hidden layer will have.
I guess that I need a matrix to connect input layer to hidden layer, a matrix to connect hidden layer to output layer, and a matrix to connect hidden layers in neighbouring time-steps, t-1 to t, t to t+1. That's total of 3 matrices.
In one tutorial, activation function was sigmoid, although I'm not sure exactly, if I use sigmoid function, I will only get output between 0 and 1. What should I use as activation function? My plan is to repeat this for n times:
For each training data:
Forward propagate
Propagate the input to hidden layer, add it to propagation of previous hidden layer to current hidden layer. And pass this to activation function.
Propagate the hidden layer to output.
Find error and its derivative, store it in a list
Back propagate
Find current layers and errors from list
Find current hidden layer error
Store weight updates
Update weights (matrices) by multiplying them by learning rate.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
It seems to be the correct way to do it, if you are just wanting to learn the basics. If you want to build a neural network for practical use, this is a very poor approach and as Marcin's comment says, almost everyone who constructs neural nets for practical use do so by using packages which have an ready simulation of neural network available. Let me answer your questions one by one...
I don't know how to decide on how many neurons the hidden layer will have.
There is no golden rule to choose the right architecture for your neural network. There are many empirical rules people have established out of experience, and the right number of neurons are decided by trying out various combinations and comparing the output. A good starting point would be (3/2 times your input plus output neurons, i.e. (10+1)*(3/2)... so you could start with a 15/16 neurons in hidden layer, and then go on reducing the number based on your output.)
What should I use as activation function?
Again, there is no 'right' function. It totally depends on what suits your data. Additionally, there are many types of sigmoid functions like hyperbolic tangent, logistic, RBF, etc. A good starting point would be logistic function, but again you will only find the right function through trial and error.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
All activation functions(including the one assigned to output neuron) will give you an output of 0 to 1, and you will have to use multiplier to convert it to real values, or have some kind of encoding with multiple output neurons. Coding this manually will be complicated.
Another aspect to consider would be your training iterations. Doing it 'n' times doesn't help. You need to find the optimal training iterations with trial and error as well to avoid both under-fitting and over-fitting.
The correct way to do it would be to use packages in Python or R, which will allow you to train neural nets with large amount of customization quickly, where you can train and test multiple nets with different activation functions (and even different training algorithms) and network architecture without too much hassle. With some amount of trial and error, you will eventually find the net that gives you desirable output.

Why use tanh for activation function of MLP?

Im personally studying theories of neural network and got some questions.
In many books and references, for activation function of hidden layer, hyper-tangent functions were used.
Books came up with really simple reason that linear combinations of tanh functions can describe nearly all shape of functions with given error.
But, there came a question.
Is this a real reason why tanh function is used?
If then, is it the only reason why tanh function is used?
if then, is tanh function the only function that can do that?
if not, what is the real reason?..
I stock here keep thinking... please help me out of this mental(?...) trap!
Most of time tanh is quickly converge than sigmoid and logistic function, and performs better accuracy [1]. However, recently rectified linear unit (ReLU) is proposed by Hinton [2] which shows ReLU train six times fast than tanh [3] to reach same training error. And you can refer to [4] to see what benefits ReLU provides.
Accordining to about 2 years machine learning experience. I want to share some stratrgies the most paper used and my experience about computer vision.
Normalizing input is very important
Normalizing well could get better performance and converge quickly. Most of time we will subtract mean value to make input mean to be zero to prevent weights change same directions so that converge slowly [5] .Recently google also points that phenomenon as internal covariate shift out when training deep learning, and they proposed batch normalization [6] so as to normalize each vector having zero mean and unit variance.
More data more accuracy
More training data could generize feature space well and prevent overfitting. In computer vision if training data is not enough, most of used skill to increase training dataset is data argumentation and synthesis training data.
Choosing a good activation function allows training better and efficiently.
ReLU nonlinear acitivation worked better and performed state-of-art results in deep learning and MLP. Moreover, it has some benefits e.g. simple to implementation and cheaper computation in back-propagation to efficiently train more deep neural net. However, ReLU will get zero gradient and do not train when the unit is zero active. Hence some modified ReLUs are proposed e.g. Leaky ReLU, and Noise ReLU, and most popular method is PReLU [7] proposed by Microsoft which generalized the traditional recitifed unit.
Others
choose large initial learning rate if it will not oscillate or diverge so as to find a better global minimum.
shuffling data
In truth both tanh and logistic functions can be used. The idea is that you can map any real number ( [-Inf, Inf] ) to a number between [-1 1] or [0 1] for the tanh and logistic respectively. In this way, it can be shown that a combination of such functions can approximate any non-linear function.
Now regarding the preference for the tanh over the logistic function is that the first is symmetric regarding the 0 while the second is not. This makes the second one more prone to saturation of the later layers, making training more difficult.
To add up to the the already existing answer, the preference for symmetry around 0 isn't just a matter of esthetics. An excellent text by LeCun et al "Efficient BackProp" shows in great details why it is a good idea that the input, output and hidden layers have mean values of 0 and standard deviation of 1.
Update in attempt to appease commenters: based purely on observation, rather than the theory that is covered above, Tanh and ReLU activation functions are more performant than sigmoid. Sigmoid also seems to be more prone to local optima, or a least extended 'flat line' issues. For example, try limiting the number of features to force logic into network nodes in XOR and sigmoid rarely succeeds whereas Tanh and ReLU have more success.
Tanh seems maybe slower than ReLU for many of the given examples, but produces more natural looking fits for the data using only linear inputs, as you describe. For example a circle vs a square/hexagon thing.
http://playground.tensorflow.org/ <- this site is a fantastic visualisation of activation functions and other parameters to neural network. Not a direct answer to your question but the tool 'provides intuition' as Andrew Ng would say.
Many of the answers here describe why tanh (i.e. (1 - e^2x) / (1 + e^2x)) is preferable to the sigmoid/logistic function (1 / (1 + e^-x)), but it should noted that there is a good reason why these are the two most common alternatives that should be understood, which is that during training of an MLP using the back propagation algorithm, the algorithm requires the value of the derivative of the activation function at the point of activation of each node in the network. While this could generally be calculated for most plausible activation functions (except those with discontinuities, which is a bit of a problem for those), doing so often requires expensive computations and/or storing additional data (e.g. the value of input to the activation function, which is not otherwise required after the output of each node is calculated). Tanh and the logistic function, however, both have very simple and efficient calculations for their derivatives that can be calculated from the output of the functions; i.e. if the node's weighted sum of inputs is v and its output is u, we need to know du/dv which can be calculated from u rather than the more traditional v: for tanh it is 1 - u^2 and for the logistic function it is u * (1 - u). This fact makes these two functions more efficient to use in a back propagation network than most alternatives, so a compelling reason would usually be required to deviate from them.
In theory I in accord with above responses. In my experience, some problems have a preference for sigmoid rather than tanh, probably due to the nature of these problems (since there are non-linear effects, is difficult understand why).
Given a problem, I generally optimize networks using a genetic algorithm. The activation function of each element of the population is choosen randonm between a set of possibilities (sigmoid, tanh, linear, ...). For a 30% of problems of classification, best element found by genetic algorithm has sigmoid as activation function.
In deep learning the ReLU has become the activation function of choice because the math is much simpler from sigmoid activation functions such as tanh or logit, especially if you have many layers. To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.
To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function
The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources