How many neurons does a perceptron have? - machine-learning

This is a classical visualization of the perceptron learning model, though I don't know where it comes from originally.
My question is How many neurons does this perceptron have? My guess is N+2, N+1 for inputs, another 1 for output. Is it correct?

The above network takes numerical inputs X1,X2,.., Xn and has weights w1 ,w2 and wn associated with those inputs. Also, there is another input 1 with weight w0 (called the bias unit) associated with it. Also this is one neuron.
This is what a bias unit does:
Bias is to provide every node with a trainable constant value (in addition to the normal inputs that the node receives).
The output is the weighted sum. Something like this:
f(x)=x1*w1+x2*w2+xn*wn+1*w0
to learn more check this, explains it very well http://117.239.79.250/moodle/pluginfile.php/6283/mod_resource/content/1/ANN1.pdf

A perceptron itself is a type of Neuron. In the figure the four inputs aren't neurons but just 4 inputs to a single neuron (perceptron). Also, the step function circle isn't n extra neuron. This step function calculation happens inside the perceptron where the weighted sum is calculated.
So what you see in the figure is a single neuron with its components broken down into fundamental parts.

Related

Multiclass classification for n classes with number of output neurons = ceiling of log2 (n)

Suppose I want to use a multilayer perceptron to classify 3 classes. When it comes to number of output neurons, anybody would instantly say - use 3 output neurons with softmax activation. But what if I use 2 output neurons with sigmoid activations to output [0,0] for class 1, [0,1] for class 2 and [1,0] for class 3? Basically getting a binary encoded output with each bit being output by each output neuron. Wouldn't this technique decrease output neurons(and hence number of parameters) by a lot? A 100 class word classification for simple NLP application would require 100 output neurons for softmax where as you can cover it with 7 output neurons with the above technique. One disadvantage is that you won't get the probability scores for all the classes. My question is, is this approach correct? If so, would you consider it to be more efficient than softmaxing for datasets with large number of classes?
You could do this, but then you would have to rethink your loss function. The cross-entropy loss used in training a model for classification is the likelihood of a categorical distribution, which assumes you have a probability associated with every class. The loss function requires 3 output probabilities and you only have 2 output values.
However, there are ways to do it anyway: you could use a binary cross-entropy loss on each element of your output, but this would be a different probabilistic assumption about your model. You'd be assuming that your classes have some shared characteristics [0,0] and [0,1] share a value. The decreased degrees of freedom are probably going to give you marginally worse performance (but other parts of the MLP may pick up the slack).
If you're really worried about the parameter cost of the final layer, then you might be better just not training it at all. This paper shows a fixed Hadamard matrix on the final layer is as good as training it.

Neural Network Developing

I am try to write a neural network class but I don't fully understand some aspects of it. I have two questions on the folling design.
Am I doing this correctly? Does the bias neuron need to connect to all of neurons (except those in the input layer) or just those in the hidden layer?
My second question is about calculation the output value. I'm using the equation below to calculate the output value of the neurons.
HiddenLayerFirstNeuron.Value =
(input1.Value * weight) + (input2.Value * weight) + (Bias.Value * weight)
After this equation, I'm calculating the activation and the result send the output. And output neurons doing same.
I'm not sure what I am do and I want to clear up problems.
Take a look at: http://deeplearning.net/tutorial/contents.html in theano. This explains everything you need to know for multi layer perceptron using theano (symbolic mathematic library).
The bias is usually connected to all hidden and output units.
Yes, you compute the input of activation function like summation of weight*output of previous layer neuron.
Good luck with development ;)
There should be a separate bias neuron for each hidden and the output layer. Think of the layers as a function applied to a first order polynomials such as f(m*x+b)=y where y is your output and f(x) your activation function. If you look at the the linear term you will recognize the b. This represents the bias and it behaves similar with neural network as with this simplification: It shifts the hyperplane up and down the in the space. Keep in mind that you will have one bias per layer connected to all neurons of that layer f((wi*xi+b)+...+(wn*xn+b)) with an initial value of 1. When it comes to gradient descent, you will have to train this neuron like a normal weight.
In my opinion should you apply the activation function to the output layer as well. This is how it's usually done with multilayer perceptrons. But it actually depends of what you want. If you, for example, use the logistic function as activation function and you want an output in the interval (0,1), then you have to apply your activation function to the output as well. Since a basic linear combination, as it is in your example, can theoretically go above the boundaries of the previously mentioned Intervall.

Maxout neurons: are the weights in the maxout function referring to 2 unique sets of weights?

I don't understand how maxout works and I suspect it's due to my visualization of the linear algebra multiplication. Basically, I'm under the impression that there are two sets of weights for the maxout functions, both individually trained and then only one is selected. But I'm suspecting this may be wrong, since I don't see a way that two different weights can be trained simultaneously in one feed forward run of the network.
Also, if the two weights w1 and w2 in the function does not refer to two unique sets of weights, then could there be more than two arguments being input to the maxout function, and of which only the max is chosen?
Here is the maxout function I read:
max((w1.T.dot(X) + b1), (w2.T.dot(X) + b2))
Is there a mental representation I could use to visualize this better?
I know this is late but I am gonna answer anyway.
Here you can check out the video by the author of maxout networks Ian Goodfellow, and here is the URL for the slides used in the video.
Below is the screenshot of the definition of Maxout Networks:
Click Here
So it turns out that you are absolutely correct. For each neuron, you create twice weights and twice biases. And if you want more, then you can create n weights and n biases for each neuron and then select with the max value, and do the same with all the neurons in the layer.

Questions about Q-Learning using Neural Networks

I have implemented Q-Learning as described in,
http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf
In order to approx. Q(S,A) I use a neural network structure like the following,
Activation sigmoid
Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
Outputs, single output. Q-Value
N number of M Hidden Layers.
Exploration method random 0 < rand() < propExplore
At each learning iteration using the following formula,
I calculate a Q-Target value then calculate an error using,
error = QTarget - LastQValueReturnedFromNN
and back propagate the error through the neural network.
Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.
Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)
Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.
Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.
Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.
Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

Neural network classifier

When you have 2 classes A with 2 elements and B with one element in 1D space in any configuration. Task is to distinguish between the two classes, to classify them. If you can choose arbitrary activation function, what is the minimal number of neurons that can solve this.
I am thinking that you always have to use at least two neurons or am I wrong?
Your question is somewhat related to the classical XOR problem for perceptrons. Let us suppose for a moment, that it's about a neural network with the specific activation function - binary threshold - which perceptron has. Then the task turns into 1D XOR problem, and then indeed you need 2 neurons in hidden layer and 1 neuron in output layer to solve it. But you mention that an arbitrary activation function can be chosen. In this case we can choose radial basis function (RBF) network. If it is possible to denote class A as output value greater than T and class B as output value less than T, then only 1 RBF neuron will suffice to distinguish the classes. If you want every class to have its own output (which value can be treated as a probability measure of input data belonging to corresponding class), then you need 2 RBF neurons.

Resources