Using softmax for multilabel classification (as per Facebook paper)

Using softmax for multilabel classification (as per Facebook paper) - machine-learning

I came across this paper by some Facebook researchers where they found that using a softmax and CE loss function during training led to improved results over sigmoid + BCE. They do this by changing the one-hot label vector such that each '1' is divided by the number of labels for the given image (e.g. from [0, 1, 1, 0] to [0, 0.5, 0.5, 0]).
However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.
Does anyone know how this would work?

I've just stumbled at this paper as well and asked myself the same question. Here's how I'd go about it.
If there is one ground truth tag, the ideal predicted vector would have a single 1 and all other predictions 0. If there were 2 tags, the ideal prediction would have two 0.5 and all others at 0. It makes sense to sort the predicted values by descending confidence and to look at the cumulative probability as we increase the number of candidates for the final number of tags.
We need to distinguish which option was the (sorted) ground truth:
1, 0, 0, 0, 0, ...
0.5, 0.5, 0, 0, 0, ...
1/3, 1/3, 1/3, 0, 0, ...
1/4, 1/4, 1/4, 1/4, 0, ...
1/5, 1/5, 1/5, 1/5, 1/5, 0, ...
The same tag position could have completely different ground truth values: 1.0 when alone, 0.5 when together with another one, 0.1 with 10 of them, and so on. A fixed threshold couldn't tell which was the correct case.
Instead, we can check the descending sort of predicted values and the corresponding cumulative sum. As soon as that sum is above a certain number (let's say 0.95), that's the number of tags that we predict. Tweaking the exact threshold number for the cumulative sum would serve as a way to influence precision and recall.

Related

How exactly do you compute the gradients for the filters in a convolutional neural network?

I learned from several articles that to compute the gradients for the filters, you just do a convolution with the input volume as input and the error matrix as the kernel. After that, you just subtract the filter weights by the gradients(multiplied by the learning rate). I implemented this process but it's not working.
I even tried doing the backpropagation process myself with pen and paper but the gradients I calculated doesn't make the filters perform any better. So am I understanding the whole process wrong?
Edit:
I will provide an example of my understanding of the backpropagation in CNNs and the problem with it.
Consider a randomised input matrix for a convolutional layer:
1, 0, 1
0, 0, 1
1, 0, 0
And a randomised weight matrix:
1, 0
0, 1
The output would be (applied ReLU activator):
1, 1
0, 0
The target for this layer is a 2x2 matrix filled with zeros. This way, we know the weight matrix should be filled with zeros also.
Error:
-1, -1
0, 0
By applying the process as stated above, the gradients are:
-1, -1
1, 0
So the new weight matrix is:
2, 1
-1, 1
This is not getting anywhere. If I repeat the process, the filter weights just go to extremely high values. So I must have made a mistake somewhere. So what is it that I'm doing wrong?

I'll give you a full example, not going to be short but hopefully you will get it. I'm omitting both bias and activation functions for simplicity, but once you get it it's simple enough to add those too. Remember, backpropagation is essentially the SAME in CNN as in a simple MLP, but instead of having multiplications you'll have convolutions. So, here's my sample:
Input:
.7 -.3 -.7 .5
.9 -.5 -.2 .9
-.1 .8 -.3 -.5
0 .2 -.1 .6
Kernel:
.1 -.3
-.5 .7
Doing the convolution yields (Result of 1st convolutional layer, and input for the 2nd convolutional layer):
.32 .27 -.59
.99 -.52 -.55
-.45 .64 .13
L2 Kernel:
-.5 .1
.3 .9
L2 activation:
.73 .29
.37 -.63
Here you would have a flatten layer and a standard MLP or SVM to do the actual classification. During backpropagation you'll recieve a delta which for fun let's assume is the following:
-.07 .15
-.09 .02
This will always be the same size as your activation before the flatten layer. Now, to calculate the kernel's delta for the current L2, you'll convolve L1's activation with the above delta. I'm not writting this down again but the result will be:
.17 .02
-.05 .13
Updating the kernel is done as L2.Kernel -= LR * ROT180(dL2.K), meaning you first rotate the above 2x2 matrix and then update the kernel. This for our toy example turns out to be:
-.51 .11
.3 .9
Now, to calculate the delta for the first convolutional layer, recall that in MLP you had the following: current_delta * current_weight_matrix. Well in Conv layer, you pretty much have the same. You have to convolve the original Kernel (before update) of L2 layer with your delta for the current layer. But this convolution will be a full convolution. The result turns out to be:
.04 -.08 .02
.02 -.13 .14
-.03 -.08 .01
With this you'll go for the 1st convolutional layer, and will convolve the original input with this 3x3 delta:
.16 .03
-.09 .16
And update your L1 kernel the same way as above:
.08 -.29
-.5 .68
Then you can start over from feeding forward. The above calculations were rounded to 2 decimal places and a learning rate of .1 was used for calculating the new kernel values.
TLDR:
You get a delta
You calculate the next delta that will be used for the next layer as: FullConvolution(Li.Input, delta)
Calculate the kernel delta that is used to update the kernel: Convolution(Li.W, delta)
Go to next layer and repeat.

how is Laplacian filter calculated?

I don't really follow how they came up with the derivative equation. Could somebody please explain in some details or even a link to somewhere with sufficient math explanation?
Laplacian filter looks like

Monsieur Laplace came up with this equation. This is simply the definition of the Laplace operator: the sum of second order derivatives (you can also see it as the trace of the Hessian matrix).
The second equation you show is the finite difference approximation to a second derivative. It is the simplest approximation you can make for discrete (sampled) data. The derivative is defined as the slope (equation from Wikipedia):
In a discrete grid, the smallest h is 1. Thus the derivative is f(x+1)-f(x). This derivative, because it uses the pixel at x and the one to the right, introduces a half-pixel shift (i.e. you compute the slope in between these two pixels). To get to the 2nd order derivative, simply compute the derivative on the result of the derivative:
f'(x) = f(x+1) - f(x)
f'(x+1) = f(x+2) - f(x+1)
f"(x) = f'(x+1) - f'(x)
= f(x+2) - f(x+1) - f(x+1) + f(x)
= f(x+2) - 2*f(x+1) + f(x)
Because each derivative introduces a half-pixel shift, the 2nd order derivative ends up with a 1-pixel shift. So we can shift the output left by one pixel, leading to no bias. This leads to the sequence f(x+1)-2*f(x)+f(x-1).
Computing this 2nd order derivative is the same as convolving with a filter [1,-2,1].
Applying this filter, and also its transposed, and adding the results, is equivalent to convolving with the kernel
[ 0, 1, 0 [ 0, 0, 0 [ 0, 1, 0
1,-4, 1 = 1,-2, 1 + 0,-2, 0
0, 1, 0 ] 0, 0, 0 ] 0, 1, 0 ]

Categorical accuracy

How does categorical accuract works? By definition
categorical_accuracy checks to see if the index of the maximal true
value is equal to the index of the maximal predicted value.
and
Calculates the mean accuracy rate across all predictions for
multiclass classification problems
What does it mean in practice? Lets say i am prediction bounding box of object
it has (xmin,ymin,xmax,ymax) does it check if xmin predicted is equal with xmin real? So if i xmin and xmax where same in prediction and real values, and ymin and ymax were different i would get 50%?
Please help me undestand this concept

Traditionally for multiclass classification, your labels will have some integer (or equivalently categorical) label; for example:
labels = [0, 1, 2]
The output of a multiclass classification prediction will typically be a probability distribution of confidences; for example:
preds = [0.25, 0.5, 0.25]
Normally the index associated with the most likely event will be the index of the label. In this case, the argmax(preds) is 1, which maps to label 1.
You can see the total accuracy of your predictions a la confusion matrices, where one axis is the "true" value, and the other axis is the "predicted" value. The values for each cell are the sums of the values of CM[y_true][y_pred]. The accuracy will be the sum of main diagonal of the matrix (y_true = y_pred) over the total number of training instances.

What is "linear projection" in convolutional neural network [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I am reading through Residual learning, and I have a question.
What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?

First up, it's important to understand what x, y and F are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.
x is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F is usually a conv layer (conv+relu+batchnorm in this paper), and y combines the two together (forming the output channel). The result of F is also of rank 4, and most of dimensions will be the same as in x, except for one. That's exactly what the transformation should patch.
For example, x shape might be (64, 32, 32, 3), where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x) might be (64, 32, 32, 16): batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.
So, in order for y=F(x)+x to be a valid operation, x must be "reshaped" from (64, 32, 32, 3) to (64, 32, 32, 16).
I'd like to stress here that "reshaping" here is not what numpy.reshape does.
Instead, x[3] is padded with 13 zeros, like this:
pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x dimensions are changed.
Here's the link to the code in Tensorflow that does this.

A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x is the vector of N input features and W is an M-byN matrix, then the matrix product Wx yields M new features where each one is a linear projection of x. Each row of W is a set of weights that defines one of the M linear projections (i.e., each row of W contains the coefficients for one of the weighted sums of x).

In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios
The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.
Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).

Reducing the number of output neurons

I am trying to train a neural network to control a characters speed in 2 dimensions. x and y between -1 and 1 m/sec. Currently I split the range into 0.1 m/sec intervals so I end up with 400 output neurons (20 x values * 20 y values) if I increase the accuracy to 0.01 I end up with 40k output neurons. Is there a way to reduce the number of output neurons?

I assume you are treating the problem as a classification problem. In the training time, you have input X and output Y. Since you are training the neural network for classification, your expected output is always like:
-1 -0.9 ... 0.3 0.4 0.5 ... 1.0 m/s
Y1 = [0, 0, ..., 1, 0, 0, ..., 0] // speed x component
Y2 = [0, 0, ..., 0, 0, 1, ..., 0] // speed y component
Y = [Y1, Y2]
That is: only one of the neurons outputs 1 for each of the speed component at x and y direction; all other neurons output 0 (in the example above, the expected output is 0.3m/s in x direction and 0.5m/s in y direction for this training instance). Actually this is probably easier to learn and has better prediction performance. But as you pointed out, it does not scale.
I think you can also treat the problem as a regression problem. In your network, you have one neuron for each of the speed component. Your expected output is just:
Y = [0.3, 0.5] // for the same training instance you have.
To get an output range of -1 to 1, you have different options for the activation function in the output layer. For example, you can use
f(x) = 2 * (Sigmoid(x) - 0.5)
Sigmoid(x) = 1 / (1 + exp(-x))
Since sigmoid (x) is in (0,1), 2*(sigmoid(x) - 0.5) is in (-1,1). This change (replace multiple neurons in the output layer with two neurons) greatly decreases the complexity of the model so you might want to add more neurons in the middle layer to avoid under fitting.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart