Why does softmax loss become 87.3365 in Caffe training? - machine-learning

I can only figure out that this relates to FLT_MIN.
With single precision floating-point, FLT_MIN=2^(-126), ln(FLT_MIN)=-87.33654475055310898657124730373.
From the definition of caffe::SoftmaxWithLossLayer :
http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1SoftmaxWithLossLayer.html ,
if loss=87.3365, this means the output of FC layer (or the layer before sofmax layer) are all FLT_MIN.
Why does this happen?

Well, let's think that the Float_MIN_Point = 1.175494350822288e-38 (2 ^ -126).
With the softmax, we're computing the Napierian logarithm of the loss. Then ln(Float_MIN_Point) = ln(1.175494350822288e-38) = -87.336544750553109 approximately 87.3365 in absolute value..
Which means, the net is not learning anything at all.
Hint: In my case, I was setting a wrong number of outputs inside the InnerProduct

Related

what method is the correct way of implemeting dice loss ? sigmoid or softmax?

I have a binary semantic segmentation problem and there is 2 method in my mind.
Method 1:
Unet output one class with sigmoid activation, then I use the dice loss to calculate the loss
Method 2:
The ground truth is concatenated to it is inverse, thus having 2 classes. The output of Unet is 2 classes and applying softmax activation to them. The dice loss is then used to calculate the loss.
Which is correct?
This question has been answered here. If you have a 2 class problem, output only 1 channel, use a sigmoid function (outputs values between 0 and 1). Then you can calculate your dice loss with output (continuous values) and target(single channel one-hot-encoded, discrete values). If your network outputs 2 channels use a softmax function and calculate your loss with your output (continous values) and target (2 channel one-hot-encoded). The former is preferred, as you will have less parameters.
Method 2 is correct, since softmax is used for multi-class problems.

Weight initialization in neural networks

Hi I am developing a neural network model using keras.
code
def base_model():
# Initialising the ANN
regressor = Sequential()
# Adding the input layer and the first hidden layer
regressor.add(Dense(units = 4, kernel_initializer = 'he_normal', activation = 'relu', input_dim = 7))
# Adding the second hidden layer
regressor.add(Dense(units = 2, kernel_initializer = 'he_normal', activation = 'relu'))
# Adding the output layer
regressor.add(Dense(units = 1, kernel_initializer = 'he_normal'))
# Compiling the ANN
regressor.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])
return regressor
I have been reading about which kernel_initializer to use and came across the link- https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404
it talks about glorot and he initializations. I have tried with different intilizations for weights, but all of them give the same results. I want to understand how important is it do a proper initialization?
Thanks
I'll give you an explanation of how much weights initialisation is important.
Let's suppose our NN has an input layer with 1000 neurons, and suppose we start to initialise weights as they are normal distributed with mean 0 and variance 1 ().
At the second layer, we assume that only 500 first layer's neurons are activated, while the other 500 not.
The neuron's input of the second layer z will be the sum of :
so, it will be even normal distributed but with variance .
This means its value will be |z| >> 1 or |z| << 1, so neurons will saturate. The network will learn slowly at all.
A solution is to initialise weights as where is the number of the inputs of the first layer. In this way z will be and so less spreader, consequently neurons are less prone to saturate.
This trick can help as a start but in deep neural networks, due to the presence of hidden multi-layers, the weights initialisation should be done at each layer. A method may be using the batch normalization
Besides this from your code I can see you'v chosen as cost function the MSE, so it is a quadratic cost function. I don't know if your problem is a classification one, but if this is the case I suggest you to use a cross-entropy function as cost function for increasing the learning rate of your network.

In neural Networks back propagation, how to get differential equations?

I am confused why dz=da*g'(z)?
as we all know, in forward propagation,a=g(z),after taking the derivative of z, I can get da/dz=g'(z),so dz=da*1/g'(z)?
Thanks!!
From what I remember, in many courses, representations like dZ are a shorter way of writing dJ/dZ and and so on. All derivatives are of the cost with respect to various parameters, activations and weighted sums etc.
The Differential equations come up based on the last layer and then you can build them backwards, the equation as per your last layer can be based on few of the activation functions.
Linear g'(z) = 1 or 1D of 1 vector based on layer dimensions
Sigmoid g'(z) = g(z)*(1-g(z))
Tanh g'(z) = 1 - thanh^2(z)
Relu = 1 if g(z)>0 or else 0
Leaky Relu = 1 if g(z)>0 and whatever leaky relu slope you kept otherwise.
From there you basically have to compute partial gradients for the the previous layers. Check out http://neuralnetworksanddeeplearning.com/chap2.html for a deeper understanding

Softmax layer and last layer of neural net

I have doubt suppose last layer before softmax layer has 1000 nodes and I have only 10 classes to classify how does softmax layer which should output 1000 probability output only 10 probabilities
The output of the 1000-node layer will be the input to the 10-node layer. Basically,
x_10 = w^T * y_1000
The w has to be of the size 1000 x 10. Now, softmax function will be applied on x_10 to produce the probability output for 10 classes.
You're wrong in your understanding! The 1000 nodes, will output 10 probabilities for EACH example, the softmax is an ACTIVATION function! It will take the linear combination of the previous layer depending on the incoming and outgoing weights, and no matter what, output the number of probabilities equal to the number of class! If you an add more details, like maybe giving an example of what you're neural network looks like, we can help you further and explain in a lot more depth so you understand what's going on!

Activity regularizer with softmax?

I have an l1 activity_regularizer=l1 in the final layer of my generative neural network:
outputs = Dense(200, activation='softmax', activity_regularizer=l1(1e-5))(x)
It makes my results better but I don't understand why it would change anything for a softmax activation. The sum of outputs = 1 , with all positive values always so the regularizer should give exactly the same loss no matter what.
What is the activity_regularizer=l1(1e-5) doing in my training ?
Due to the Softmax, the contribution of the L1-Regularization to the total cost is in fact constant.
However, the gradient of the regularization term is non-zero and equals to the number of non-zero activations (gradient of abs is sign, so we have a sum of signs of activations that are positive due to the softmax).
You can try to run with and without the L1-term and check how many non-zero elements you end up with.

Resources