Why does this model have softmax layer? - machine-learning

This picture is from this paper: http://arxiv.org/pdf/1511.02300v2.pdf. I could not understand what is the functionality of softmax in this model. If our goal is to find the bounding boxes for object detection, why do we use softmax at the end?

Softmax is applied to class based output (look at the graph, it is not boundig box output!). Boundig box output does not use softmax but rather normal output + L1 loss.

Related

Why can't you use a 3D volume input for LSTM?

In the paper CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning I recently read, it specified that the 3D volume output from a CNN layer must be reduced into a 2 dimensional sequence before entering the LSTM layer, why is that? What's wrong with using the 3 dimensional format?
The standard LSTM neural network assumes input of the following size:
[batch size] × [sequence length] × [feature dim]
The LSTM first multiplies each vector of size [feature dim] by a matrix, and then combines them in a fancy way. What's important here is that there's a vector per each example (the batch dimensions) and each timestep (the seq. length dimension). In a sense, this vector is first transformed by a matrix multiplication(s) (possibly involving some pointwise non-linearities, which don't change the shape, so I don't mention them) into a hidden state update, which is also a vector, and the updated hidden state vector is then used to produce the output (also a vector).
As you can see, the LSTM is designed to operate on vectors. You could design a Matrix-LSTM – an LSTM counterpart that assumes any or all of the following are matrices: the input, the hidden state, the output. That would require you to replace matrix-vector multiplications that process the input (or the state) by a generatlized linear operation that is able to turn any matrix into any other, which would be given by a rank-4 tensor, I believe. However, it'd be equivalent to just reshaping the input matrix into a vector, reshaping the rank-4 tensor into a matrix, doing matrix-vector product and then reshaping the output back into a matrix, so it makes little sense to devise such Matrix-LSTMs instead of just reshaping your inputs.
That said, it might still make sense to design a generalized LSTM that takes something other than a vector as input if the you know something about the input structure that instructs a more specific linear operator than a general rank-4 tensor. For example, images are known to have local structure (nearby pixels are more related than those far apart), hence using convolutions is more "reasonable" than reshaping images to vectors and then performing a general matrix multiplication. In a similar fashion you could replace all the matrix-vector multiplications in the LSTM with convolutions, which would allow for image-like input, states and outputs.

What is loss_cls and loss_bbox and why are they always zero in training

I'm trying to train a custom dataset on using faster_rcnn using the Pytorch implementation of Detectron here. I have made changes to the dataset and configuration according to the guidelines in the repo.
The training process is carried out successfully, but the loss_cls and loss_bbox values are 0 from the beginning and even though the training is completed, final output cannot be used to make an evaluation or an inference.
I would like to know what these two mean and how to get those values to change during the training. The exact model I'm using here is e2e_faster_rcnn_R-50-FPN_1x
Any help regarding this would be appreciated. I' using Ubuntu 16.04 with Python 3.6 on Anaconda, CUDA 9, cuDNN 7.
What are the two losses?
When training a multi-object detector, you usually have (at least) two types of losses:
loss_bbox: a loss that measures how "tight" the predicted bounding boxes are to the ground truth object (usually a regression loss, L1, smoothL1 etc.).
loss_cls: a loss that measures the correctness of the classification of each predicted bounding box: each box may contain an object class, or a "background". This loss is usually called cross entropy loss.
Why are the losses always zero?
When training a detector, the model predict quite a few (~1K) possible boxes per image. Most of them are empty (i.e. belongs to "background" class). The loss function associate each of the predicted boxes with the ground truth boxes annotation of the image.
If a predicted box has a significant overlap with a ground truth box then loss_bbox and loss_cls are computed to see how well the model is able to predict the ground truth box.
On the other hand, if a predicted box has no overlap with any ground truth box, than only loss_cls is computed for the "background" class.
However, if there is only very partial overlap with ground truth the predicted box is "discarded" and no loss is computed. I suspect, for some reason, this is the case for your training session.
I suggest you check the parameters that determines the association between predicted boxed and ground truth annotations. Moreover, look at the parameters of your "anchors": these parameters determines the scale and aspect ratios of the predicted boxes.

Understanding Faster rcnn

I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's
Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).

Conclusion from PCA of dataset

I have a set of data for sequence labeling.
I did PCA with (with 2 principal components on the x and y axis) on the dataset and it turns out as below:
Using an LSTM network to classify the dataset above, I then decided to extract the activations from the hidden layer of the LSTM. What I obtain is like the figure below:
My question is, what conclusion can I draw by comparing both the results?
Is it fair to say that the features of the original dataset are now self-organized after running it through an LSTM classifier?

Gradient decent on the inputs of a pre-trained neural network to achieve a target y-value

I have a trained neural network which suitably maps my inputs to my outputs. Is it then possible to specify a desired y output and then use a gradient decent method to determine the optimum input values to get that output?
When using backpropegation, the partial derivative of a weight is used with error function to proportionally adjust the weights; is there a way to do something similar with the input values themselves and a target y value?
A neural network is basically a complex mathematical function. By adjusting the weights you basically adjust that function's parameters. Given that, your question is if you can easily and automatically invert the function. I don't think this can be done easily.
I think that the only thing you can do is to create another inverted network and train it with inverted data.

Resources