I have doubt suppose last layer before softmax layer has 1000 nodes and I have only 10 classes to classify how does softmax layer which should output 1000 probability output only 10 probabilities
The output of the 1000-node layer will be the input to the 10-node layer. Basically,
x_10 = w^T * y_1000
The w has to be of the size 1000 x 10. Now, softmax function will be applied on x_10 to produce the probability output for 10 classes.
You're wrong in your understanding! The 1000 nodes, will output 10 probabilities for EACH example, the softmax is an ACTIVATION function! It will take the linear combination of the previous layer depending on the incoming and outgoing weights, and no matter what, output the number of probabilities equal to the number of class! If you an add more details, like maybe giving an example of what you're neural network looks like, we can help you further and explain in a lot more depth so you understand what's going on!
Related
Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.
I am building a simple multi-layer perceptron that takes an image as an input and gives as an output the classification of the image. My image dataset is composed by grayscale images with size (n x m). I choose as input layer nm input neurons (in reality I am reducing dimensions with PCA, but let's keep it simple). Then I choose intermediate hidden layers. Now what should I choose as my output layer? How many neurons and why? My classification uses, say, L different classes (i.e., L different types of images). Should I use a single output neuron?
Since you have L different classes you should have L output neurons, in keras it would be :
...
previous_layer = tf.keras.layers.Dense(4096)(...)
output = tf.keras.layers.Dense(self.nb_class)(previous_layer)
If you were in a binary classification you would need a sigmoid activation
output = tf.keras.layers.Activation('sigmoid')(output)
If L > 2 then you would go for softmax activation.
output = tf.keras.layers.Activation('softmax')(output)
Last thing, you should try some Convolutions layers before going for Dense layers. Look at VGG16 architecture.
If you want to work with just a single model (ML-NN),
if L = 2,
The number of neurons in the last layer can be just one with sigmoid activation (the most common approach).
You can also avoid sigmoid, and simply use a threshold to do the binary classification.
If L > 2,
The number of neurons should be L with softmax activation.
A special-case is a multi-label classification, in which you want to know if for a sample, there can be multiple classes or not.
Then, use L neurons with sigmoid activation.
I have a binary semantic segmentation problem and there is 2 method in my mind.
Method 1:
Unet output one class with sigmoid activation, then I use the dice loss to calculate the loss
Method 2:
The ground truth is concatenated to it is inverse, thus having 2 classes. The output of Unet is 2 classes and applying softmax activation to them. The dice loss is then used to calculate the loss.
Which is correct?
This question has been answered here. If you have a 2 class problem, output only 1 channel, use a sigmoid function (outputs values between 0 and 1). Then you can calculate your dice loss with output (continuous values) and target(single channel one-hot-encoded, discrete values). If your network outputs 2 channels use a softmax function and calculate your loss with your output (continous values) and target (2 channel one-hot-encoded). The former is preferred, as you will have less parameters.
Method 2 is correct, since softmax is used for multi-class problems.
I trained a network on a real-value labels (floating point numbers from 0.0 to 1.0) - several residual blocks in the beginning, and the last layers are
fully-connected layer with 64 neurons + ELU activation,
fully-connected layer with 16 neurons + ELU activation,
output logistic regression layer ( 1 neuron with y = 1 / (1 + exp(-x) ).
After training, I visualised weights of the layer with 16 neurons:
figure rows represents weights that every single 1 of 16 neurons developed for every single 1 of 64 neurons of previous layer, indices are 0..15 and 0..63;
UPD: figure shows neurons weights correlation (Pearson);
UPD: figure shows neurons weights MAD (mean absolute difference) - this proves redundancy event better than correlation.
Now the detailed questions:
Can we say that there are redundant features? I see several redundant groups of neurons: 0,4; 1,6,7 (maybe 8,11,15 too); 2,14; 12,13 (maybe) .
is it bad ?
if so, is there any regularizer, that penalizes redundant neuron weights, and makes neurons develop uncorrelated weights?
I use adam regularizer, Xavier initialization (the best of the tested), weight decay 1e-5/batch (the best of the tested), other output layers did not work as well as logistic regression (by means of precison & recall & lack of overfitting).
I use only 10 filters in each resnet blocks (which are 10, too) to address overfitting.
Are you using Tensorflow ? if yes, is post training quantization an option ?
tensorflow.org/lite/performance/post_training_quantization
This has some similar effect to what you need but also makes other improvements.
Alternatively maybe you can also try to use Quantization-aware training
https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize
I am doing binary class classification using deep neural network. Whenever I am using binary_crossentropy my model is not giving good accuracy (it is closer to the random prediction). But if I use categorical crossentropy by making the size of the output layer 2, I am getting good accuracy in only 1 epoch which is close to the 0.90. Can anyone please explain what is happening here?
I also have this problem while trying to use binary_crossentropy with softmax activation in the output layer. As far as I know, softmax give the probability of each class, so if your output layer has 2 nodes, it will be something like p(x1), p(x2) and x1 + x2 = X. Therefore, if you have only 1 output node, it will always be equals to 1.0 (100%), that's why you have close to random prediction (honestly, it will be close to your category distribution in the evaluation set).
Try changing it to another activation method like sigmoid or relu.