Neural networks - why everybody has different approach with XOR - machine-learning

Currently I'm trying to learn how to work with neural networks by reading books, but mostly internet tutorials.
I often see that "XOR is 'Hello World' of neural networks".
But here is a thing: The author of one tutorial says that for neural network that calculates XOR value we should use 1 hidden layer with 2 neurons. Also he uses backpropagation with deltas to adjust weights.
I implemented this, but even after 1 million epochs I have a problem that network is stuck with input data 1 and 1. There should be "0" as an answer, but answer is usually 0.5something. I checked my code, it is correct.
If I'll try to add just 1 more neuron in the hidden layer, network is successfully calculating XOR after ~50 000 epochs.
At the same time some people saying that "XOR is not a trivial task and we should use network with 2-3 or more layers". Why?
Come on, if XOR creates so much problems, maybe we shouldn't use it as 'hello world' of neural networks? Please explain what is going on.

So neural networks are really interesting. Theres a proof that says that a single perceptron can learn any linear function given enough time. Even more impressive, a neural network with one hidden layer can apparently learn any function, though I've yet to see a proof on that one.
XOR is a good function for teaching neural networks because as CS students, those in the class are likely already familiar with it. In addition, it is not trivial in the sense that a single perceptron can learn it. it isn't linear. See this graphic I put together.
There is no line that separates these values. YET, it is simple enough for humans to understand, and more importantly, that a human can understand the neural network that can solve it. NN are very blackbox-y, it becomes hard to tell why they work really fast. Hell, here is another network config that can solve XOR.
Your example of a more complicated network solving it faster shows the power that comes from combining more neurons and more layers. Its absolutely unnecessary to use 2-3 hidden layers to solve it, but it sure helps speed up the process.
The point is that it is a simple enough problem to solve by human and on a black-board in class, while also being slightly more challenging than a given linear function.
EDIT: Another fantastic example for teaching NNs practically is the MNIST hand drawn digit classification data set. I find that it very easily shows a problem that is simultaneously very simple for humans to understand, very hard to write a non learning program for, and a very practical use case for machine learning. The problem is that the network structure is impossible to draw on a blackboard and trace what is happening in a way practical for a class. XOR achieves this.
EDIT 2: Also, without the code it will probably be hard to diagnose why it isn't converging. Did you write the neurons yourself? What about the optimization function, etc?
EDIT 3: If the output of your function last node is 0.5, try using a step squashing function that makes all values below .5 into 0, and all values above 0.5 into 1. You only have binary output anyway so why bother with a continuous activation on the last node?

Related

How to deal with ill-conditioned neural networks?

When dealing with ill-conditioned neural networks, is the current state of the art to use an adaptive learning rate, some very sophisticated algorithm to deal with the problem, or to eliminate the ill conditioning by preprocessing/scaling of the data?
The problem can be illustrated with the simplest of scenarios: one input and one output where the function to be learned is y=x/1000, so a single weight whose value needs to be 0.001. One data point (0,0). It turns out to matter a great deal, if you are using gradient descent, whether the second data point is (1000,1) or (1,0.001).
Theoretical discussion of the problem, with expanded examples.
Example in TensorFlow
Of course, straight gradient descent is not the only available algorithm. Other possibilities are discussed at here - however, as that article observes, the alternative algorithms it lists that are good at handling ill condition, are not so good when it comes time to handle a large number of weights.
Are new algorithms available? Yes, but these aren't clearly advertised as solutions for this problem, are perhaps intended to solve a different set of problems; swapping in Adagrad in place of GradientDescent does prevent overshoot, but still converges very slowly.
At one time, there were some efforts to develop heuristics to adaptively tweak the learning rate, but then instead of being just a number, the learning rate hyperparameter is a function, much harder to get right.
So these days, is the state of the art to use a more sophisticated algorithm to deal with ill condition, or to just preprocess/scale the data to avoid the problem in the first place?

TensorFlow - GradientDescentOptimizer - are we actually finding global optimum?

I was playing with tensorflow for quite a while, and I have more of a theoretical question. In general, when we train a network we usually use GradientDescentOptimizer (probably its variations like adagrad or adam) to minimize the loss function. In general it looks like we are trying to adjust weights and biases so that we get the global minimum of this loss function. But the issue is that I assume that this function has an extremely complicated look if you plot it, with lots of local optimums. What I wonder is how can we be sure that Gradient Descent finds global optimum and that we are not getting instantly stuck in some local optimum instead which is far away from global optimum?
I recollect that for example when you are performing clustering in sklearn it usually runs clustering algorithm several times with random initialization of cluster centers, and by doing this we ensure that we are not getting stuck with not optimal result. But we are not doing something like this while training ANNs in tensorflow - we start with some random weights and just travel along the slope of the function.
So, any insight into this? Why we can be more or less sure that the results of training via gradient descent are close to global minimum once the loss stops to decrease significantly?
Just to clarify, why I am wondering about this matter is that if we can't be sure that we get at least close to global minimum we can't easily judge which of 2 different models is actually better. Because we could run experiment, get some model evaluation which shows that model is not good... But actually it just stuck in local minimum shortly after training started. While other model which seemed for us to be better was just more lucky to start training from a better starting point and didn't stuck in local minimum fast. Moreover, this issue means that we can't even be sure that we get maximum from the network architecture we currently could be testing. For example, it could have really good global minimum but it is hard to find it and we mostly get stuck with poor solutions at local minimums, which would be far away from global optimum and never see the full potential of network at hand.
Gradient descent, by its nature, is looking at the function locally (local gradient). Hence, there is absolutely no guarantee that it will be the global minima. In fact, it probably will not be unless the function is convex. This is also the reason GD like methods are sensitive to initial position you start from. Having said that, there was a recent paper which said that in high-dimensional solution spaces, the number of maximas/minimas are not as many as previously thought.
Finding global minimas in high dimensional space in a reasonable way seems very much an unsolved problem. However, you might wanna focus more on saddle points rather than minimas. See this post for example:
High level description for saddle point problem
A more detailed paper is here (https://arxiv.org/pdf/1406.2572.pdf)
Your intuition is quite right. Complex models such as neural networks are typically applied to problems with high dimensional input, where the error surface has a very complex landscape.
Neural networks are not guaranteed to find the global optimum and getting stuck in local minima is a problem where a lot of research has been focussed. If you’re interested in finding out more about this, it would be good to look at techniques such as online learning and momentum, which have traditionally been used to avoid the problem of local minima. However, these techniques in themselves bring further difficulties e.g. integrating online learning is not possible for some optimisation techniques and the addition of a momentum hyper-parameter to the backpropagation algorithm brings further difficulties in training.
A really good video for visualising the influence of momentum (and how it overcomes local minma) during backpropagation can be found here.
Added after question edit - see comments
It’s the aforementioned nature of the problems neural networks are applied to that means we often can’t find a globally optimal solution, because (in the general case) traversing the entire search space for the optimal solution would be intractable using classical computing technology (quantum computers could change this for some problems). As such neural networks are trained to achieve what is hopefully a ‘good’ local optimum.
If you're interested in reading more detailed information about the techniques employed to find good local optima (i.e. something that approximates a global solution) a good paper to read might be this
No. Gradient descent method helps to find out the local minima. In case if global minima and local minima are same then only we get the actual result i.e. global minima.

Training a neural network with complex valued weights (initialized complex valued weights real valued inputs)

As the question states. I am aiming to train a neural network where the weights are complex numbers. Using the default scikit learn netwokrs and building on this (editing the source code) the main problem I have encountered is that the optimizing functions used in scikit learn taken from scipy only support numerical optimization of functions whose input are real numbers.
Scikit learn is rather poor for neural networks it seems specially if you are wishing to fork and edit the structure is rather unflexible.
As I have noticed and read in a paper here I need to change things such as the error function to ensure that at the top level the error remains in the domain of real numbers or the problem becomes ill defined.
My question here is are there any standard libraries that may do this already ? or any easy tweaks that I could do the lasagne or tensorflow to save my life ?
P.S. :
Sorry for not posting any working code. It is a difficult question to format to the stackoverflow standards and I do admit it may be out of topic in which case I apologize if such.
The easiest way to do this is to divide your feature into the real and imaginary components. I've done similar work with vector input from a leap motion and it significantly simplifies things if you divide vectors into their component axis.
Tensorflow has elementary complex number support.
If you have to build the neural network nodes by yourself, you can take a glance at this blog.
For holomorphic functions, complex BP are fairly straight forward.
For non-holomorphic functions, they need careful treat.

How to model a for loop in a neural network

I am currently in the process of learning neural networks and can understand basic examples like AND, OR, Addition, Multiplication, etc.
Right now, I am trying to build a neural network that takes two inputs x and n, and computes pow(x, n). And, this would require the neural network to have some form of a loop, and I am not sure how I can model a network with a loop
Can this sort of computation be modelled on a neural network? I am assuming it is possible.. based on the recently released paper(Neural Turing Machine), but not sure how. Any pointers on this would be very helpful.
Thanks!
Feedforward neural nets are not Turing-complete, and in particular they cannot model loops of arbitrary order. However, if you fix the maximum n that you want to treat, then you can set up an architecture which can model loops with up to n repetitions. For instance, you could easily imagine that each layer could act as one iteration in the loop, so you might need n layers.
For a more general architecture that can be made Turing-complete, you could use Recurrent Neural Networks (RNN). One popular instance in this class are the so-called Long short-term memory (LSTM) networks by Hochreiter and Schmidhuber. Training such RNNs is quite different from training classical feedforward networks, though.
As you pointed out, Neural Turing Machines seem to working well to learn the basic algorithms. For instance, the repeat copy task which has been implemented in the paper, might tell us that NTM can learn the algorithm itself. As of now, NTMs have been used only for simple tasks so understanding its scope by using the pow(x,n) will be interesting given that repeat copy works well. I suggest reading Reinforcement Learning Neural Turing Machines - Revised for a deeper understanding.
Also, recent developments in the area of Memory Networks empower us to perform more complicated tasks. Hence, to make a neural network understand pow(x,n) might be possible. So go ahead and give it a shot!

Neural network to approximate the wanted solution of a system of differential equations

I have a target solution for a system of differential equations that has some unknown parameters.I want to find the values of these parameters for which the solution is closer to the target.Can I do this with neural networks?If yes,how?
I am asking this because a paper I'm reading (unfortunately in Greek) implies to be doing this very thing.
There is the following system of differential equations
The wanted output is
and the control input u has some unknown non-linearities in it which it is stated that are approximated using neural networks. Since there is no data to train a network,I wasn;t able to understand how it is done? Any ideas?
I don't think so; the typical usage of a NN is to learn a pattern from a set of examples, so that it can properly classify examples it hasn't seen. That description doesn't seem to fit your problem.
Update (after question was edited): I don't think the specifics of the equation are relevant. As you say, there is no data with which to train a network, so it would be hard (if not impossible) to evaluate that aspect, as it might as well have been done by flipping coins. Thus, I think you'd have to focus on the other aspects of the paper (assuming there are some).

Resources