Should I normalize the inputs in my Neural Network?

Should I normalize the inputs in my Neural Network? - machine-learning

first some context.
I'm taking on a very abitious project, making a Neural Network capable of playing Chess at a decent level. I might not succeed but I'm doing it mainly to learn how to approach this kind of machine learning.
I've decided I want to train the network using a genetic algorithm to fine tune the weights after different neural nets have fought against each other in a few games of chess.
Each neuron utilizes a hyperbolic tangent(-1, 1) to normalize the data after it has been processed but no normalization just yet to the input before it enters the network.
I've taken some inspiration from the Giraffe chess engine, particularly the inputs.
They are going to look sort of like this:
First layer:
number of remaining White Pawns (0-8)
number of remaining Black Pawns (0-8)
number of remaining White Knights (0-2)
number of remaining Black Knights (0-2)
....
Second layer still on the same level as the first:
Position of Pawn 1 (probably going with 2 values, x[0-7] and y[0-7])
Position of Pawn 2
...
Position of Queen 1
Position of Queen 2
...
Third layer, again on the same level of the previous two. The data is only going to "crosstalk" after the next layer of abstraction.
Values of pieces attacked by Pawn1 (this is going to be in the 0-12 ish range)
Values of pieces attacked by Pawn2
...
Value of pieces attacked by Bishop1
You get the idea.
If you didn't here's a terrible Paint representation of what I mean:
The question is: should I normalize the input data before it is read by the Neural Network?
I feel like squishing the data might not be such a good idea but I really don't have the competence to make a conclusive call.
I hope someone here can enlighten me on the subject and if you think I should normalize the data, I would like it if you could suggest some ways of doing so.
Thanks!

You shouldn't need to normalize anything inside the network. The point of machine learning is to train weights and bias to learn a non-linear function, in your example it'd be static chess evaluation. Thus, your second Normalized blue vertical bar (near the final output) is unnecessary.
Note: Hidden layers is a better terminology than abstraction layer, so I'll use it instead.
The other normalization you have before the hidden layers is optional but recommended. It also depends on what input we're talking about.
The Giraffe paper writes in page 18:
"Each slot has normalized x coordinate, normalized y coordinate ..."
Chess has 64 squares, without normalization the range would be [0,1,....63]. This is very discrete and the range is much higher than the other inputs (more about later). It does make sense to normalize them to something more manageable and comparable to the other inputs. The paper doesn't say how exactly it gets normalized, but I don't see why [0...1] range wouldn't work. It makes sense to normalize chess squares (or coordinate).
The other inputs such as whether there's a queen on the board is true or false, and thus require no normalization. For example, the Giraffe paper writes in page 18:
... whether piece is present or absent ...
Clearly, you wouldn't normalize it.
Answer to your question
If you represent Piece Count Layer as in Giraffe, you shouldn't need to normalize. But if you prefer a discrete representation in [0..8] (because there could be 9 queens in chess), you might want to normalize.
If you represent Piece Position Layer with chess squares, you should normalize just like Giraffe.
Giraffe doesn't normalize Piece Attack Defense Layer possibly it represents the information as the lowest-valued attacker and defender of each square. Unfortunately, the paper doesn't explicitly state how this is done. Your implementation might require normalization, so use your common sense.
Without any prior assumption which features would be more relevant for the model, you should normalize them to a comparable scaling.
EDITED
Let me answer your comment. Normalization is not the correct term, what you're talking is activation function (https://en.wikipedia.org/wiki/Activation_function). Normalization and activation function are not the same thing.

Related

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.

I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.

I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

What “information” in document vectors makes sentiment prediction work?

Sentiment prediction based on document vectors works pretty well, as examples show:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://linanqiu.github.io/2015/10/07/word2vec-sentiment/
I wonder what pattern is in the vectors making that possible. I thought it should be similarity of vectors making that somehow possible. Gensim similarity measures rely on cosine similarity. Therefore, I tried the following:
Randomly initialised a fix “compare” vector, get cosine similarity of the “compare” vector with all other vectors in training and test set, use the similarities and the labels of the train set to estimate a logistic regression model, evaluate the model with the test set.
Looks like this, where train/test_arrays contain document vectors and train/test_labels labels either 0 or 1. (Notice, document vectors are obtained from genism doc2vec and are well trained, predicting the test set 80% right if directly used as input for the logistic regression):
fix_vec = numpy.random.rand(100,1)
def cos_distance_to_fix(x):
return scipy.spatial.distance.cosine(fix_vec, x)
train_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=train_arrays), newshape=(-1,1))
test_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=test_arrays), newshape=(-1,1))
classifier = LogisticRegression()
classifier.fit(train_arrays_cos, train_labels)
classifier.score(test_arrays_cos, test_labels)
It turns out, that this approach does not work, predicting the test set only to 50%....
So, my question is, what “information” is in the vectors, making the prediction based on vectors work, if it is not the similarity of vectors? Or is my approach simply not possible to capture similarity of vectors correct?

This is less a question about Doc2Vec than about machine-learning principles with high-dimensional data.
Your approach is collapsing 100-dimensions to a single dimension – the distance to your random point. Then, you're hoping that single dimension can still be predictive.
And roughly all LogisticRegression can do with that single-valued input is try to pick a threshold-number that, when your distance is on one side of that threshold, predicts a class – and on the other side, predicts not-that-class.
Recasting that single-threshold-distance back to the original 100-dimensional space, it's essentially trying to find a hypersphere, around your random point, that does a good job collecting all of a single class either inside or outside its volume.
What are the odds your randomly-placed center-point, plus one adjustable radius, can do that well, in a complex high-dimensional space? My hunch is: not a lot. And your results, no better than random guessing, seems to suggest the same.
The LogisticRegression with access to the full 100-dimensions finds a discriminating-frontier for assigning the class that's described by 100 coefficients and one intercept-value – and all of those 101 values (free parameters) can be adjusted to improve its classification performance.
In comparison, your alternative LogisticRegression with access to only the one 'distance-from-a-random-point' dimension can pick just one coefficient (for the distance) and an intercept/bias. It's got 1/100th as much information to work with, and only 2 free parameters to adjust.
As an analogy, consider a much simpler space: the surface of the Earth. Pick a 'random' point, like say the South Pole. If I then tell you that you are in an unknown place 8900 miles from the South Pole, can you answer whether you are more likely in the USA or China? Hardly – both of those 'classes' of location have lots of instances 8900 miles from the South Pole.
Only in the extremes will the distance tell you for sure which class (country) you're in – because there are parts of the USA's Alaska and Hawaii further north and south than parts of China. But even there, you can't manage well with just a single threshold: you'd need a rule which says, "less than X or greater than Y, in USA; otherwise unknown".
The 100-dimensional space of Doc2Vec vectors (or other rich data sources) will often only be sensibly divided by far more complicated rules. And, our intuitions about distances and volumes based on 2- or 3-dimensional spaces will often lead us astray, in high dimensions.
Still, the Earth analogy does suggest a way forward: there are some reference points on the globe that will work way better, when you know the distance to them, at deciding if you're in the USA or China. In particular, a point at the center of the US, or at the center of China, would work really well.
Similarly, you may get somewhat better classification accuracy if rather than a random fix_vec, you pick either (a) any point for which a class is already known; or (b) some average of all known points of one class. In either case, your fix_vec is then likely to be "in a neighborhood" of similar examples, rather than some random spot (that has no more essential relationship to your classes than the South Pole has to northern-Hemisphere temperate-zone countries).
(Also: alternatively picking N multiple random points, and then feeding the N distances to your regression, will preserve more of the information/shape of the original Doc2Vec data, and thus give the classifier a better chance of finding a useful separating-threshold. Two would likely do better than your one distance, and 100 might approach or surpass the 100 original dimensions.)
Finally, some comment about the Doc2Vec aspect:
Doc2Vec optimizes vectors that are somewhat-good, within their constrained model, at predicting the words of a text. Positive-sentiment words tend to occur together, as do negative-sentiment words, and so the trained doc-vectors tend to arrange themselves in similar positions when they need to predict similar-meaning-words. So there are likely to be 'neighborhoods' of the doc-vector space that correlate well with predominantly positive-sentiment or negative-sentiment words, and thus positive or negative sentiments.
These won't necessarily be two giant neighborhoods, 'positive' and 'negative', separated by a simple boundary –or even a small number of neighborhoods matching our ideas of 3-D solid volumes. And many subtleties of communication – such as sarcasm, referencing a not-held opinion to critique it, spending more time on negative aspects but ultimately concluding positive, etc – mean incursions of alternate-sentiment words into texts. A fully-language-comprehending human agent could understand these to conclude the 'true' sentiment, while these word-occurrence based methods will still be confused.
But with an adequate model, and the right number of free parameters, a classifier might capture some generalizable insight about the high-dimensional space. In that case, you can achieve reasonably-good predictions, using the Doc2Vec dimensions – as you've seen with the ~80%+ results on the full 100-dimensional vectors.

Machine Learning - SVM

If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?

You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.

Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004

This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )

Activation function for neural network

I need help in figuring out a suitable activation function. Im training my neural network to detect a piano note. So in this case I can have only one output. Either the note is there (1) or the note is not present (0).
Say I introduce a threshold value of 0.5 and say that if the output is greater than 0.5 the desired note is present and if its less than 0.5 the note isn't present, what type of activation function can I use. I assume it should be hard limit, but I'm wondering if sigmoid can also be used.

To exploit their full power, neural networks require continuous, differentable activation functions. Thresholding is not a good choice for multilayer neural networks. Sigmoid is quite generic function, which can be applied in most of the cases. When you are doing a binary classification (0/1 values), the most common approach is to define one output neuron, and simply choose a class 1 iff its output is bigger than a threshold (typically 0.5).
EDIT
As you are working with quite simple data (two input dimensions and two output classes) it seems a best option to actually abandon neural networks and start with data visualization. 2d data can be simply plotted on the plane (with different colors for different classes). Once you do it, you can investigate how hard is it to separate one class from another. If data is located in the way, that you can simply put a line separating them - linear support vector machine would be much better choice (as it will guarantee one global optimum). If data seems really complex, and the decision boundary has to be some curve (or even set of curves) I would suggest going for RBF SVM, or at least regularized form of neural network (so its training is at least quite repeatable). If you decide on neural network - situation is quite similar - if data is simply to separate on the plane - you can use simple (linear/threshold) activation functions. If it is not linearly separable - use sigmoid or hyperbolic tangent which will ensure non linearity in the decision boundary.
UPDATE
Many things changed through last two years. In particular (as suggested in the comment, #Ulysee) there is a growing interest in functions differentable "almost everywhere" such as ReLU. These functions have valid derivative in most of its domain, so the probability that we will ever need to derivate in these point is zero. Consequently, we can still use classical methods and for sake of completness put a zero derivative if we need to compute ReLU'(0). There are also fully differentiable approximations of ReLU, such as softplus function

The wikipedia article has some useful "soft" continuous threshold functions - see Figure Gjl-t(x).svg.
en.wikipedia.org/wiki/Sigmoid_function.
Following Occam's Razor, the simpler model using one output node is a good starting point for binary classification, where one class label is mapped to the output node when activated, and the other class label for when the output node is not activated.

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?

It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.

In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.

Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.

There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent

When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.

Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.

I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).

The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca

On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.

Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart