I am using the sigmoid cross entropy loss function for a multilabel classification problem as laid out by this tutorial. However, in both their results on the tutorial and my results, the output predictions are in the range (-Inf, Inf), while the range of a sigmoid is [0, 1]. Is the sigmoid only processed in the backprop? That is, shouldn't a forward pass squash the output?
In this example the input to the "SigmoidCrossEntropyLoss" layer is the output of a fully-connect layer. Indeed there are no constraints on the values of the outputs of an "InnerProduct" layer and they can be in range [-inf, inf].
However, if you examine carefully the "SigmoidCrossEntropyLoss" you'll notice that it includes a "Sigmoid" layer inside -- to ensure stable gradient estimation.
Therefore, at test time, you should replace the "SigmoidCrossEntropyLoss" with a simple "Sigmoid" layer to output per-class predictions.
Related
I am having a hard time finding resources online about how to preform backpropagation with the bias in a convolutional neural network. By bias I mean the number added to every number resulting from a convolution.
Here is a picture further explaining
I know how to calculate the gradient for the filter's weights but I am not sure what to do about the biases. Right now I am just adjusting it by the average error for that layer. Is this correct?
It is similar to the bias gradient in standard neural networks but here we sum over all the gradients w.r.t convolution output:
where L is the loss function, w and h are the width and height of the conv output, is the gradient of the conv output w.r.t the loss function.
Thus, the gradient of b is computed by summing all the convolution output gradients at each position (w, h) w.r.t the loss function L.
Hope this helps.
I am trying to build a recurrent convolutional autoencoder in Tensorflow, but I am having trouble linking the convolutional autoencoder with the recurrent layer.
From my understanding the a Tensorflow RNNCell takes in an input of shape (batch_size, time_steps, info_vector), but my 1D convolutional layer has an output shape of (batch_size, info_vector). Is there a way to have tensorflow store the previous information vectors. Alternatively do I need to use a 2D convolution, add an extra time_step dimension to the input and then not convolve over that dimension?
Try to expand the dimensionality of the tensor:
cnn_out = last_output_of_cnn # for example shape [32,10]
cnn_out = tf.expand_dims(cnn_output, axis=-1) # new shape [32,10,1]
You can use this in the first layer of your RNN, where here "timestep" is 10.
I am trying to make sure I'm using the correct terminology. The below diagram shows the MNIST example
X is 784 row vector
W is 784X10 matrix
b is a 10 row vector
The out of the linear box is fead into softmax
The output of softmax is fed into the distance function cross-entropy
How many layers are in this NN? What are the input and hidden layer in that example?
Similarly, how many layers are in this answer If my understanding is correct, then 3 layers?
Edit
#lejlot Does the below represent a 3 layered NN with 1 hidden layer?
Take a look at this picture:
http://cs231n.github.io/assets/nn1/neural_net.jpeg
In your first picture you have only two layers:
Input layers -> 784 neurons
Output layer -> 10 neurons
Your model is too simple (w contains directly connections between the input and the output and b contains the bias terms).
With no hidden layer you are obtaining a linear classifier, because a linear combination of linear combinations is a linear combination again. The hidden layers are what include non linear transformations in your model.
In your second picture you have 3 layers, but you are confused the notation:
The input layer is the vector x where you place an input data.
Then the operation -> w -> +b -> f() -> is the conexion between the first layer and the second layer.
The second layer is the vector where you store the result z=f(xw1+b1)
Then softmax(zw2+b2) is the conexion between the second and the third layer.
The third layer is the vector y where you store the final result y=softmax(zw2+b2).
Cross entropy is not a layer is the cost function to train your neural network.
EDIT:
One more thing, if you want to obtain a non linear classifier you must add a non linear transformation in every hidden layer, in the example that I have described, if f() is a non linear function (for example sigmoid, softsign, ...):
z=f(xw1+b1)
If you add a non linear transformation only in the output layer (the softmax function that you have at the end) your outputs are still linear classifiers.
That has 1 hidden layer.
The answer you link to, I would call a 2-hidden layer NN.
Your input-layer is the X-vector.
Your layer Wx+b is the hidden layer, aka. the box in your picture.
The output-layer is the Soft-max.
The cross-entropy is your loss/cost function, and is not a layer at all.
When convoluting a multi-channel image into one channel image, usually you can have only one bias variable(as output is one channel). If I want to set local biases, that is, set biases for each pixel of the output image, how shall I do this in caffe and torch?
In Tensorflow, this is very simple. your just set a bias matrix, for example:
data is 25(height)X25(width)X48(channels)
weights is 3X3(kernel size)X48(input channels)X1(output channels)
biases is 25X25,
then,
hidden = tf.nn.conv2d(data, weights, [1, 1, 1, 1], padding='SAME')
output = tf.relu(hidden+biases)
Is there a similar solution in caffe ortorch?
For caffe, here is a scale layer post: Scale layer in Caffe. Scale layer can only provide one variable bias.
The answer is Bias layer. bias layer can have a weight matrix, treat it as biases.
For torch, torch has a nn.Add() layer, almost like the tensorflow's tf.add() function, so nn.Add() layer is the solution.
All these have been proved by actual models.
But still thank you very much #Shai
If I have a feed-forward multilayer perceptron with sigmoid activation function, which is trained and has known weights, how can I find the equation of the curve that is approximated by the network (the curve that separates between 2 types of data)?
In general, there is no closed form solution for the input points where your NN output is 0.5 (or 0, in case of -1/1 instead of 0/1).
What is usually done for visualization in low-dimensional input space is gridding up the input space and computing the contours of the NN output. (The contours are smooth estimate of what the NN response surface looks like.)
In MATLAB, one would do
[X,Y] = meshgrid(linspace(-1,1), linspace(-1,1));
contour(f(X,Y))
where f is your trained NN, and assuming [-1,1] x [-1,1] space.