How to calculate the amount of connections in neural network - machine-learning

I have exactly this scenario and I need to know how many connections this set has. I searched in several places and I'm not sure of the answer. That is, I do not know how to calculate the number of connections on my network, this is still unclear to me. What I have is exactly as follows:
** Having bias in all but least of the input
Input: 784
First hidden layer - Output: 400
Second hidden layer - Output: 200
Output layer - Output: 10
I would calculate this as follows: ((784 * 400) + bias) + ((400 * 200) + bias) + ((200 * 10) + bias) = XXX
I do not know if this is correct. I need help figuring out how to solve this, and if it's not just something mathematical, what's the theory to do this calculation?
Thank you.

Your calculation is correct for total number of weights. When you have n neurons connected to m neurons, the number connections between neurons is n*m. You can see this by drawing a small graph, say 3 neurons connected to 4 neurons. You will see that there's 12 connections between the two layers. So if you want connections rather than weights, just drop the '+bias' parts of your equation.
If you want total weights, then the number is simply (n*m+m) since you get 1 bias weight for each of the m neurons in the second layer.
Total connections in that neural network: (784*400)+(400*200)+(200*10)
Total weights in that neural network: (784*400+400)+(400*200+200)+(200*10+10)

Related

Wouldn't dropout layer ruin the network?

Imagine that you have a deep neural network on a regression problem to predict weight of a person. Your neural network looks like this:
Dense(112, activation='relu')
Dense(512, activation='relu)
Dropout(0.5)
Dense(1, activation='relu')
Now during training, let's say that 50 % of nodes corresponds to an output of around 30 to 100 depending on the input. Now during testing, when the dropout is not happening, will not the output of around 30 to 100 double because previously, when just 50 % of nodes were active, this 50 % of nodes were passing some value to the output node, and thus we were getting output of around 30 to 100, and now during testing, when all the nodes are active, so all the nodes are passing some value to the output node. So if 50 % of nodes got an output of around 30 to 100 will not this value be doubled during testing, when 100 % nodes are active?
As even said by Dr. Snoopy, in the comments, that dropout has a functionality (multiplication by p), to prevent the problem I am asking about.
If a unit is retained by a probability of p, then at test time, all the ongoing weights of that unit will be first multiplied by p.
So, in the case of predicting weights that i mentioned in the question, all the ongoing weights after the dropout layer will first be multiplied by p (0.5 in this case), making the test output same as training output, of around 30 to 100, and fixing the problem!

Is there any regularizer that penalizes redundant neurons?

I trained a network on a real-value labels (floating point numbers from 0.0 to 1.0) - several residual blocks in the beginning, and the last layers are
fully-connected layer with 64 neurons + ELU activation,
fully-connected layer with 16 neurons + ELU activation,
output logistic regression layer ( 1 neuron with y = 1 / (1 + exp(-x) ).
After training, I visualised weights of the layer with 16 neurons:
figure rows represents weights that every single 1 of 16 neurons developed for every single 1 of 64 neurons of previous layer, indices are 0..15 and 0..63;
UPD: figure shows neurons weights correlation (Pearson);
UPD: figure shows neurons weights MAD (mean absolute difference) - this proves redundancy event better than correlation.
Now the detailed questions:
Can we say that there are redundant features? I see several redundant groups of neurons: 0,4; 1,6,7 (maybe 8,11,15 too); 2,14; 12,13 (maybe) .
is it bad ?
if so, is there any regularizer, that penalizes redundant neuron weights, and makes neurons develop uncorrelated weights?
I use adam regularizer, Xavier initialization (the best of the tested), weight decay 1e-5/batch (the best of the tested), other output layers did not work as well as logistic regression (by means of precison & recall & lack of overfitting).
I use only 10 filters in each resnet blocks (which are 10, too) to address overfitting.
Are you using Tensorflow ? if yes, is post training quantization an option ?
tensorflow.org/lite/performance/post_training_quantization
This has some similar effect to what you need but also makes other improvements.
Alternatively maybe you can also try to use Quantization-aware training
https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize

Softmax layer and last layer of neural net

I have doubt suppose last layer before softmax layer has 1000 nodes and I have only 10 classes to classify how does softmax layer which should output 1000 probability output only 10 probabilities
The output of the 1000-node layer will be the input to the 10-node layer. Basically,
x_10 = w^T * y_1000
The w has to be of the size 1000 x 10. Now, softmax function will be applied on x_10 to produce the probability output for 10 classes.
You're wrong in your understanding! The 1000 nodes, will output 10 probabilities for EACH example, the softmax is an ACTIVATION function! It will take the linear combination of the previous layer depending on the incoming and outgoing weights, and no matter what, output the number of probabilities equal to the number of class! If you an add more details, like maybe giving an example of what you're neural network looks like, we can help you further and explain in a lot more depth so you understand what's going on!

How to update batch weights in back propagation

How gradient descent algorithm updates the batch weights in the back propagation method?
Thanks in advance ...
It is really easy once you understand the algorithm.
New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters
Let's consider a neural network with two inputs, two hidden neurons, two output neurons.
First, introduce weights and bias to your network. Then, compute total net input for hidden layer, as such
net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1
Do the same for all other hidden layers.
Next, we can now calculate the error for each output neuron using the squared error function and sum them to get the total error.
Hereinafter, you will have to calculate the partial derivative of the total network error with respect to the previous weights, to find out how each weights affects the network. I have included a visual to help with your understanding.
I strongly suggest you go through this beginner friendly introduction to back-propagation to have a firm grasp of the concept. I hope my beginner post help you get started in the journey of Machine Learning!

Torch - Large input and output sizes into neural network

I'm new to machine learning and seek some help.
I would like to train a network to predict the next values I expect as follow:
reference: [val1 val2 ... val 15]
val = 0 if it doesn't exists, 1 if it does exist.
Input: [1 1 1 0 0 0 0 0 1 1 1 0 0 0 0]
Output: [1 1 1 0 0 0 0 0 1 1 1 0 0 1 1] (last two values appear)
So my neural network would have 15 inputs and 15 outputs
I would like to know if there would be a better way to do that kind of prediction. Do my data would need normalization also?
Now the problem is, I dont have 15 values, but actually 600'000 of them. Can a neural network handle such big tensors, and I've hear I would need twice the number for hidden layer units.
Thanks a lot for your help, you machine learning expert!
Best
This is not a problem for the concept of a neural network: the question is whether your computing configuration and framework implementation deliver the required memory. Since you haven't described your topology, there's not a lot we can do to help you scope this out. What do you have for parameter and weight counts? Each of those is at least a short float (4 bytes). For instance, a direct FC (fully-connected) layer would give you (6e5)^2 weights, or 3.6e11 * 4 bytes => 1.44e12 bytes. Yes, that's pushing 1.5 terabytes for that layer's weights.
You can get around some of this with the style of NN you choose. For instance, splitting into separate channels (say, 60 channels of 1000 features each) can give you significant memory savings, albeit at the cost of speed in training (more layers) and perhaps some accuracy (although crossover can fix a lot of that). Convolutions can also save you overall memory, again at the cost of training speed.
600K => 4 => 600K
That clarification takes care of my main worries: you have 600,000 * 4 weights in each of two places: 1,200,004 parameters and 4.8M weights. That's 6M total floats, which shouldn't stress the RAM of any modern general-purpose computer.
The channelling idea is when you're trying to have a fatter connection between layers, such as 600K => 600K FC. In that case, you break up the data into smaller groups (usually just 2-12), and make a bunch of parallel fully-connected stream. For instance, you could take your input and make 10 streams, each of which is 60K => 60K FC. In your next layer, you swap the organization, "dealing out" each set of 60K so that 1/10 goes into each of the next channels.
This way, you have only 10 * 60K * 60K weights, only 10% as many as before ... but now there are 3 layers. Still, it's a 5x saving on memory required for weights, which is where you have the combinatorial explosion.

Resources