Why XOR problem works better with bipolar representation? - machine-learning

I am undertaking a course in Neural Networks and the Professor introduced us to the XOR problem. I understand the XOR problem is not linearly separable and why we need to employ Neural Network for this problem.
However, he mentioned XOR works better with a bipolar representation(-1, +1) which I do not understand. I have looked into resources in different websites yet I can't really understand the reason.
I am wondering why a bipolar representation would be better than a binary representation? What's the rationale for that argument?

The weight deltas of input nodes involve input values. When using the binary representation, an input node may have value 0, meaning that its weight delta is 0. In other words, this node can't 'learn' anything when this input vector is applied.
By contrast, if a bipolar representation is used, this can be avoided because the input nodes never have value 0. It means input nodes can always learn and thus helps the training converge faster.

Related

Machine learning with a variable-sized real vector of inputs?

I have a collection of objects with properties that I measure. For each object, I obtain a vector of real numbers describing that object. The vector is always incomplete: there's usually numbers missing from the beginning or end of what would be the complete vector, and sometimes there is information missing in the middle. Hence, each object results in a vector of a different length. I also measure, say, the mass of each object, and I now want to relate the vector of things I've measured to the mass.
It's common in my field (astrophysics) to extract features from this vector of real numbers, e.g. take an average or some linear combinations of the values; and then use those extracted features to infer the mass (or whatever) using for example neural networks. It was recently shown, however, that a very complex combination of the elements of the vector result in a much better model of the mass.
There are still residuals in this model, however, even when working on simulated data. Presumably then there is a better way out there to manipulate these variable-length vectors in order to get a better model.
I am wondering if it is possible to do machine learning with real-valued input vectors of all different lengths. I know for text mining there are things like the bag-of-words approach, but it is unclear how such a method would work on real-valued vectors. I know recurrent neural networks work on sentences of variable length, but I'm not sure they work for real-valued vectors. I have also considered imputing the missing data; however, sometimes it is missing for physical reasons, i.e. a value in such-and-such place cannot exist, and so imputing it would violate the physicality of the situation.
Is there any research in this area?
Recurrent Neural Networks (RNNs) are capable of taking a variable-sized input vector of length n and producing a variable sized output vector of length m.
There are many ways to make RNNs work. The most common cell types are called Long short-term memory (LSTM) and Gated Recurrent Unit (GRU).
You might want to read:
The Unreasonable Effectiveness of Recurrent Neural Networks: Nice to get an idea what RNNs are capable of, especially character predictors. It is easy to read, but not exactly what you're searching.
Understanding LSTM Networks: More technical; very well written
Sepp Hochreiter, Jurgen Schmidhuber: LONG SHORT-TERM MEMORY
RNNs in TensorFlow
However, training RNNs takes a lot of training data. You might be better off with computing a fixed-size feature vector from it. But you never know when you don't try it ;-)

Neural network input with exponential decay

Often, to improve learning rates, inputs to a neural network are preprocessed by scaling and shifting to be between -1 and 1. I'm wondering though if that's a good idea with an input whose graph would be exponentially decaying. For instance, if I had an input with integer values 0 to 100 distributed with the majority of inputs being 0 and smaller values being more common than large values, with 99 being very rare.
Seems that scaling them and shifting wouldn't be ideal, since now the most common value would be -1. How is this type of input best dealt with?
Consider you're using a sigmoid activation function which is symmetric around the origin:
The trick to speed up convergence is to have the mean of the normalized data set be 0 as well. The choice of activation function is important because you're not only learning weights from the input to the first hidden layer, i.e. normalizing the input is not enough: the input to the second hidden layer/output is learned as well and thus needs to obey the same rule to be consequential. In the case of non-input layers this is done by the activation function. The much cited Efficient Backprop paper by Lecun summarizes these rules and has some nice explanations as well which you should look up. Because there's other things like weight and bias initialization that one should consider as well.
In chapter 4.3 he gives a formula to normalize the inputs in a way to have the mean close to 0 and the std deviation 1. If you need more sources, this is great faq as well.
I don't know your application scenario but if you're using symbolic data and 0-100 is ment to represent percentages, then you could also apply softmax to the input layer to get better input representations. It's also worth noting that some people prefer scaling to [.1,.9] instead of [0,1]
Edit: Rewritten to match comments.

Can neural networks approximate any function given enough hidden neurons?

I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
I can't think of how it could. It seems like a very obvious limitation of neural networks that can potentially limit what it can do. For example, because of this limitation, neural networks probably can't properly approximate many functions used in statistics like Exponential Moving Average, or even variance.
Speaking of moving average, can recurrent neural networks properly approximate that? I understand how a feedforward neural network or even a single linear neuron can output a moving average using the sliding window technique, but how would recurrent neural networks do it without X amount of hidden layers (X being the moving average size)?
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that? We would first need to know how many timesteps it should have, which we don't. Perhaps a LSTM network could, but even then, what if it's not a simple moving average, it's an exponential moving average? I don't think even LSTM can do it.
Even worse still, what if f(x,x1) that we are trying to learn is simply
f(x,x1) = x * x1
That seems very simple and straightforward. Can a neural network learn it? I don't see how.
Am I missing something huge here or are machine learning algorithms extremely limited? Are there other learning techniques besides neural networks that can actually do any of this?
The key point to understand is compact:
Neural networks (as any other approximation structure like, polynomials, splines, or Radial Basis Functions) can approximate any continuous function only within a compact set.
In other words the theory states that, given:
A continuous function f(x),
A finite range for the input x, [a,b], and
A desired approximation accuracy ε>0,
then there exists a neural network that approximates f(x) with an approximation error less than ε, everywhere within [a,b].
Regarding your example of f(x) = x2, yes you can approximate it with a neural network within any finite range: [-1,1], [0, 1000], etc. To visualise this, imagine that you approximate f(x) within [-1,1] with a Step Function. Can you do it on paper? Note that if you make the steps narrow enough you can achieve any desired accuracy. The way neural networks approximate f(x) is not much different than this.
But again, there is no neural network (or any other approximation structure) with a finite number of parameters that can approximate f(x) = x2 for all x in [-∞, +∞].
The question is very legitimate and unfortunately many of the answers show how little practitioners seem to know about the theory of neural networks. The only rigorous theorem that exists about the ability of neural networks to approximate different kinds of functions is the Universal Approximation Theorem.
The UAT states that any continuous function on a compact domain can be approximated by a neural network with only one hidden layer provided the activation functions used are BOUNDED, continuous and monotonically increasing. Now, a finite sum of bounded functions is bounded by definition.
A polynomial is not bounded so the best we can do is provide a neural network approximation of that polynomial over a compact subset of R^n. Outside of this compact subset, the approximation will fail miserably as the polynomial will grow without bound. In other words, the neural network will work well on the training set but will not generalize!
The question is neither off-topic nor does it represent the OP's opinion.
I am not sure why there is such a visceral reaction, I think it is a legitimate question that is hard to find by googling it, even though I think it is widely appreciated and repeated outloud. I think in this case you are looking for the actually citations showing that a neural net can approximate any function. This recent paper explains it nicely, in my opinion. They also cite the original paper by Barron from 1993 that proved a less general result. The conclusion: a two-layer neural network can represent any bounded degree polynomial, under certain (seemingly non-restrictive) conditions.
Just in case the link does not work, it is called "Learning Polynomials with Neural Networks" by Andoni et al., 2014.
I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
The only way I can make sense of that question is that you're talking about extrapolation. So e.g. given training samples in the range -1 < x < +1 can a neural network learn the right values for x > 100? Is that what you mean?
If you had prior knowledge, that the functions you're trying to approximate are likely to be low-order polynomials (or any other set of functions), then you could surely build a neural network that can represent these functions, and extrapolate x^2 everywhere.
If you don't have prior knowledge, things are a bit more difficult: There are infinitely many smooth functions that fit x^2 in the range -1..+1 perfectly, and there's no good reason why we would expect x^2 to give better predictions than any other function. In other words: If we had no prior knowledge about the function we're trying to learn, why would we want to learn x -> x^2? In the realm of artificial training sets, x^2 might be a likely function, but in the real world, it probably isn't.
To give an example: Let's say the temperature on Monday (t=0) is 0°, on Tuesday it's 1°, on Wednesday it's 4°. We have no reason to believe temperatures behave like low-order polynomials, so we wouldn't want to infer from that data that the temperature next Monday will probably be around 49°.
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that?
I think that's two questions: First, can a neural network represent that function? I.e. is there a set of weights that would give exactly that behavior? It obviously depends on the network architecture, but I think we can come up with architectures that can represent (or at least closely approximate) this kind of function.
Question two: Can it learn this function, given enough training samples? Well, if your learning algorithm doesn't get stuck in a local minimum, sure: If you have enough training samples, any set of weights that doesn't approximate your function gives a training error greater that 0, while a set of weights that fit the function you're trying to learn has a training error=0. So if you find a global optimum, the network must fit the function.
A network can learn x|->x * x if it has a neuron that calculates x * x. Or more generally, a node that calculates x**p and learns p. These aren't commonly used, but the statement that "no neural network can learn..." is too strong.
A network with ReLUs and a linear output layer can learn x|->2*x, even on an unbounded range of x values. The error will be unbounded, but the proportional error will be bounded. Any function learnt by such a network is piecewise linear, and in particular asymptotically linear.
However, there is a risk with ReLUs: once a ReLU is off for all training examples it ceases learning. With a large domain, it will turn on for some possible test examples, and give an erroneous result. So ReLUs are only a good choice if test cases are likely to be within the convex hull of the training set. This is easier to guarantee if the dimensionality is low. One work around is to prefer LeakyReLU.
One other issue: how many neurons do you need to achieve the approximation you want? Each ReLU or LeakyReLU implements a single change of gradient. So the number needed depends on the maximum absolute value of the second differential of the objective function, divided by the maximum error to be tolerated.
There are theoretical limitations of Neural Networks. No neural network can ever learn the function f(x) = x*x
Nor can it learn an infinite number of other functions, unless you assume the impractical:
1- an infinite number of training examples
2- an infinite number of units
3- an infinite amount of time to converge
NNs are good in learning low-level pattern recognition problems (signals that in the end have some statistical pattern that can be represented by some "continuous" function!), but that's it!
No more!
Here's a hint:
Try to build a NN that takes n+1 data inputs (x0, x1, x2, ... xn) and it will return true (or 1) if (2 * x0) is in the rest of the sequence. And, good luck.
Infinite functions especially those that are recursive cannot be learned. They just are!

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources