Correct me if I am wrong here, but it is possible to implement the XOR function with a minimum of 3 gates (NAND, OR)->(AND) using a 1-layer network. But is it possible to train the network correctly, having each perceptron use only a threshold activation function and a perceptron training rule? i.e. use the perceptron learning rule and not the delta learning rule.
So far my only solution in theory would be to train each perceptron individually for their specific task (i.e. NAND OR and AND) before forming the actual network, but that defeats the point of a network that learns.
No, you cannot use the perceptron algorithm to train a multilayer network. You need gradient-based learning, and the perceptron algorithm does not produce gradients; it optimizes for the non-differentiable zero-one loss.
The answer is simple as we remember that perceptron law deal with single layer(one gate,or,and,either nand gate)but xor gate contains more than one combination of (and,or and nand )gates that is why perceptron law not satisfy XOR GAT
Related
I know we can't use perceptron learning algorithm to implement XOR gate because it is a lineraly inseparable problem. So my question is which learning algorithm and which neural network can we use to implement XOR gate? I tried using Delta rule, but it is not producing desired weight matrix.
Thank You!
A 2 layered MLP (multi-layer perceptron) will do the trick.
Consider this article.
By the way, Wikipedia reads:
The delta rule is a gradient descent learning rule for updating the
weights of the inputs to artificial neurons in a single-layer neural
network.
The "single-layer neural network" here is the issue. As you said, a simple (single layer) perceptron does not have the representational power to capture XOR.
I'm following the course on Machine Learning from Coursera and I just had an interrogation.
Multiple classifier making a xor classifier
On this picture we can see that in order to make a xor classifier we build other smaller classifiers which are trained with linearly separable gate.
So each classifier has a job (for example AND, OR, etc) defined and the network must be trained for this task.
But in a bigger neural net it's impossible to define a task for each neuron (or classifier).
So my question is : Is this the task of the Back-Propogation algorithm (in addition to the fact that it is used to update the weight) ?
If someone is wondering the same thing, yes it is the case.
The backprop algorithm makes "smaller linear solvable" per each neuron (or classifier).
I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.
So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).
My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Thanks in advance.
I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?
In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.
Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?
Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass.
If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.
Highly recommend this post on backpropagation details.
I am using feed forward, gradient descent backpropagation neural networks.
Currently I have only worked with non-linear networks where tanh is activation function.
I was wondering.
What kind of tasks would you give to a neural networks with non-linear activation function and what kind of tasks for linear?
I know that network with linear activation function are used to solve linear problems.
What are those linear problems?
Any examples?
Thanks!
I'd say never, since composition of linear functions is still linear using a neural network with linear activations is just a way to complicate linear regression.
Whether to choose a linear model or something more complicated is up to you and depends on the data you have; this is (one of the reasons) why it is customary hold out some data during training and use it to validate the model. Other ways of testing models are residuals analysis, hypothesis testing, and so on
What is the difference between training a RNN and a simple neural networks? Can RNN be trained using feed forward and backward method?
Thanks ahead!
The difference is recurrence. Thus RNN cannot be easily trained as if you try to compute gradient - you will soon figure out that in order to get a gradient on n'th step - you need to actually "unroll" your network history for n-1 previous steps. This technique, known as BPTT (backpropagation through time) is exactly this - direct application of backpropagation to RNN. Unfortunately this is both computationaly expensive as well as mathematically challenging (due to vanishing/exploding gradients). People are creating workaround on many levels, by for example introduction of specific types of RNN which can be efficiently trained (LSTM, GRU), or by modification of training procedure (such as gradient clamping). To sum up - theoreticaly you can do "typical" backprop in the mathematical sense, from programming perspective - this requires more work as you need to "unroll" your network through history. This is computationaly expensive, and hard to optimize in the mathematical sense.