Why do activation functions in neural networks take such small values?

Why do activation functions in neural networks take such small values? - machine-learning

Indeed, even if the values of the activation function were in the values from -10 to 10, this would make the network more flexible, as it seems to me. After all, the problem cannot be only in the absence of a suitable formula. Please explain what I am missing.

The activation function of a particular node in a neural network takes the weighted sum of the previous layer.
If this previous layer is a layer with an activation function, then it will just be a weighted sum of nodes and weights that have been offset by the previous activation function on each node. If this activation function is a squashing function, such as the sigmoid, then all of the operands in the weighted sum are squashed to smaller numbers before being added together.
If you only have a couple of nodes in the previous layer, then the number being passed to the current node with an activation function will likely be small. However, if the number of nodes in the previous layer is large, then the number will not necessarily be small.
The input to an activation function in a neural network depends on:
the size of the previous layer
the activation function of the previous layer
the value of the weights connecting these layers
the values of the nodes in the previous layer
Therefore, the values passed to an activation function can really be anything.

Related

Neural Network with Input - Relu - SoftMax - Cross Entropy Weights and Activations grow unbounded

I have implemented a neural network with 3 layers Input to Hidden Layer with 30 neurons(Relu Activation) to Softmax Output layer. I am using the cross entropy cost function. No outside libraries are being used. This is working on the NMIST dataset so 784 input neurons and 10 output neurons.
I have got about 96% accuracy with hyperbolic tangent as my hidden layer activation.
When I try to switch to relu activation my activations grow very fast which cause my weights grow unbounded as well until it blows up!
Is this a common problem to have when using relu activation?
I have tried L2 Regularization with minimal success. I end up having to set the learning rate lower by a factor of ten compared to the tanh activation and I have tried adjusting the weight decay rate accordingly and still the best accuracy I have gotten is about 90%. The rate of weight decay is still outpaced in the end by the updating of certain weights in the network which lead to an explosion.
It seems everyone is just replacing their activation functions with relu and they experience better results, so I keep looking for bugs and validating my implementation.
Is there more that goes into using relu as an activation function? Maybe I have problems in my implemenation, can someone validate accuracy with the same neural net structure?

as you can see the Relu function is unbounded on positive values, thus creating the weights to grow
in fact, that's why hyperbolic tangent and alike function are being used in those cases, to bound the output value between a certain range (-1 to 1 or 0 to 1 in most cases)
there is another approach to deal with this phenomenon called weights decay
the basic motivation is to get a more generalised model (avoid overfitting) and make sure the weights won't blow up you use a regulation value depending on the weight itself when update them
meaning that bigger weights get bigger penalty
you can farther read about it here

Total weight of inputs to a neuron in ANN

In ANN, we know that to make it "learn", we need to adjust the weights of the inputs to a particular neuron.
total_input=summation(w(j,i).a(j))
During adjustment, some weights are to be reduced while others to be increased.
Is the total weight of all j inputs to the i-th neuron should be 1?

There's absolutely no reason for the weights in the linear layer (a.k.a. dense or fully-connected layer) to sum up to anything specific, such as 1.0. They are usually initialized with small random numbers (so initial sum is unlikely to be 1.0) and then get tweaked somehow (not completely independently, but at least differently).
If the neural network doesn't use any regularization, it's often possible to train the network to large weight values, much larger than 1.0 (see also this question).
There are particular cases, when an analogous condition is true, for example softmax layer, which mathematically guarantees that the sum of outputs is 1.0. But the linear layer doesn't guarantee anything like that.

Neural network, minimum number of neurons

I've got a 2D surface where a ship (with constant speed) navigates around the scene to pick up candy. For every candy the ship picks up I increase the fitness. The NN has one output to steer the ship (0 for left and 1 for right, so 0.5 would be straight forward) There are four inputs in the range [-1 .. 1], that represents two normalized vectors. The ship direction and the direction to the piece of candy.
Is there any way to calculate the minimum number of neurons in the hidden layer? I also tried giving two inputs instead of four, the first was the dot product [-1..1] (where I dotted the ship direction with the direction to the candy) and the second was (0/1) if the candy was to the left/right of the ship. It seems like this approach worked a lot better with fewer neurons in the hidden layer.

Fewer inputs should imply fewer number of neurons. This is because the number of input combinations decrease and it gets easier for the neural network to learn the system. There is no golden rule as to how to calculate the best number of nodes in the hidden layer. However, with 2 inputs I'd say 2 hidden nodes should work fine. It really depends on the degree of non linearity in your inputs.

Defining the number of hidden layers and the number of neurons in each hidden layers always was a challenge and it may diverge from each type of problems. By the way, a single hidden layer in a feedforward neural network could solve most of the problems, given it can aproximate functions.
Murata defined some rules to use in neural networks to define the number of hidden neurons in a feedforward neural network:
The value should be between the size of the input and output layers.
The value should be 2/3 the size of the input layer plus the size of the output layer.
The value should be less than twice the size of the input layer
You could try these rules and evaluate the impact of it in your neural network.

How to use the custom neural network function in the MATLAB Neural Network Toolbox

I'm trying to create the neural network shown below. It has 3 inputs, 2 outputs, and 2 hidden layers (so 4 layers altogether, or 3 layers of weight matrices). In the first hidden layer there are 4 neurons, and in the second hidden layer there are 3. There is a bias neuron going to the first and second hidden layer, and the output layer.
I have tried using the "create custom neural network" function in MATLAB, but I can't get it to work how I want it to.
This is how I used the function
net1=network(3,3,[1;1;1],[1,1,1;0,0,0;0,0,0],[0,0,0;1,0,0;0,1,0],[0,0,0])
view(net1)
And it gives me the neural network shown below:
As you can see, this isn't what I want. There are only 3 weights in the first layer, 1 in the second, 1 in the output layer, and only one output. How would I fix this?
Thanks!
Just to clarify how I want this network to work:
The user will input 3 numbers into the network.
Each one of the 3 inputs is multiplied by 4 different weights, and then these numbers are sent to the 4 neurons in the first hidden layer.
The bias node acts the same as one of the inputs, but it always has a value of 1. It is multiplied by 4 different weights, and then sent to the 4 neurons in the first hidden layer.
Each neuron in the first hidden layer sums the 4 numbers going into it, and then passes this number through the sigmoid activation function.
The neurons in the first hidden layer then output 4 numbers that are each multiplied by 3 different weights, and sent to the 3 neurons in the second hidden layer.
The bias node going to the second hidden layer works the same as the first bias node
Each neurons in the second hidden layer sums up the 5 numbers going into it and passes it through the sigmoid activation function.
The neurons in the second layer then output two numbers that are again multiplied by weights and go to each of the outputs
The output layer also sums all of its inputs, including its bias input, and then passes this through the sigmoid activation function to get the final two values.

After some time playing around I've figured out how to do it. The code I needed to use is:
net = newff([0 1; 0 1; 0 1],[4,3 2],{'logsig','logsig','logsig'})
view(net)
This creates the network I was looking for.
I was originally mistaken about the matlab representation of neural networks. The green arrows show the path of all of the numbers, not just a single number.

In 3 layer MLP, why should the input to hidden weights be random?

For Example for 3-1-1 layer if the weights are initialized equally the MLP might not learn well. But why does this happen?

If you only have one neuron in the hidden layer, it doesn't matter. But, imagine a network with two neurons in the hidden layer. If they have the same weights for their input, than both neurons would always have the exact same activation, there is no additional information by having a second neuron. And in the backpropagation step, those weights would change by an equal amount. Hence, in every iteration, those hidden neurons have the same activation.

It looks like you have a typo in your question title. I'm guessing that you mean why should the weights of hidden layer be random. For the example network you indicate (3-1-1), it won't matter because you only have a single unit in the hidden layer. However, if you had multiple units in the hidden layer of a fully connected network (e.g., 3-2-1) you should randomize the weights because otherwise, all of the weights to the hidden layer will be updated identically. That is not what you want because each hidden layer unit would be producing the same hyperplane, which is no different than just having a single unit in that layer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart