In RNN world, does it matter which end of the word vector is padded so they all have the same length?
Example
pad_left = [0, 0, 0, 0, 5, 4, 3, 2]
pad_right = [5, 4, 3, 2, 0, 0, 0, 0]
Some guys did experiments on the pre-padding and post-padding in their paper Effects of Padding on LSTMs and CNNs. Here is their conclusion.
For LSTMs, the accuracy of post-padding (50.117%) is way less than pre-padding (80.321%).
Pre-padding and post padding doesn’t matter much to CNN because unlike LSTMs, CNNs don’t try to remember stuff from the previous output, but instead tries to find pattern in the given data.
I have never expected such big effect of padding positions, so I suggest you verify it yourself.
Related
I came across this paper by some Facebook researchers where they found that using a softmax and CE loss function during training led to improved results over sigmoid + BCE. They do this by changing the one-hot label vector such that each '1' is divided by the number of labels for the given image (e.g. from [0, 1, 1, 0] to [0, 0.5, 0.5, 0]).
However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.
Does anyone know how this would work?
I've just stumbled at this paper as well and asked myself the same question. Here's how I'd go about it.
If there is one ground truth tag, the ideal predicted vector would have a single 1 and all other predictions 0. If there were 2 tags, the ideal prediction would have two 0.5 and all others at 0. It makes sense to sort the predicted values by descending confidence and to look at the cumulative probability as we increase the number of candidates for the final number of tags.
We need to distinguish which option was the (sorted) ground truth:
1, 0, 0, 0, 0, ...
0.5, 0.5, 0, 0, 0, ...
1/3, 1/3, 1/3, 0, 0, ...
1/4, 1/4, 1/4, 1/4, 0, ...
1/5, 1/5, 1/5, 1/5, 1/5, 0, ...
The same tag position could have completely different ground truth values: 1.0 when alone, 0.5 when together with another one, 0.1 with 10 of them, and so on. A fixed threshold couldn't tell which was the correct case.
Instead, we can check the descending sort of predicted values and the corresponding cumulative sum. As soon as that sum is above a certain number (let's say 0.95), that's the number of tags that we predict. Tweaking the exact threshold number for the cumulative sum would serve as a way to influence precision and recall.
I have a very simple 2 dimensional software simulation of a ball that can roll down a hill. (see image)
The ball starts at the top of the hill (at x-position 0 and y-position 50)
and can roll down to either the left or the right side of the hill. The ball's and hill's properties (mass, material, etc.) stay constant but are unknown. Also the force that pushes the ball down the hill is unknown.
Moreover there is wind that either accelerates or decelerates the ball. The wind speed is known and can change every time.
I collected 1000 trials (i.e. rolling the ball down the hill 1000 times).
For each trial, I sampled the ball's position and the wind speed at every 10th millisecond and saved the data into a csv file.
The file looks like this:
trial_nr,direction,time(ms),wind(km/h),ball_position
0, None, 0, 10, 0/50
0, right, 10, 10, 5/45
0, right, 20, 4, 8/30
0, right, 30, 6, 10/25
0, right, 40, 7, 15/15 <- stop position (label)
1, None, 0, 7, 0/50
1, right, 10, 8, 7/43
...
1, right, 50, 3, 20/15 <- stop position (label)
2, None, 0, 8, 0/50
2, left, 10, 1, -3/48
2, left, 20, 0, -4/47
...
2, left, 60, 3, -17/15 <- stop position (label)
...more trials...
What I want to do now is to predict the stop position of the ball for future trials.
Concretely, I want to train a machine learning model with this csv file.
When it is trained, it gets as input the direction/wind/ball_position of a trial of the first 30ms and based on this data it should predict an estimate where the ball will stop after it rolled down the hill. This is a supervised learning problem since the label is the stop position of the ball.
I was thinking of using a neural network that gets as input the feature vectors in the csv file. But the problem is that not every feature vector has a label associated. Only the last feature vector of every trial has a label (the stop position). So how can I solve the given problem?
Every help is very appreciated.
I am new to deep learning and attempting to understand how CNN performs image classification
i have gone through multiple youtube videos, multiple blogs and papers as well. And they all mention roughly the same thing:
add filters to get feature maps
perform pooling
remove linearity using RELU
send to a fully connected network.
While this is all fine and dandy, i dont really understand how convolution works in essence. Like for example. edge detection.
like for ex: [[-1, 1], [-1,1]] detects a vertical edge.
How? Why? how do we know for sure that this will detect a vertical edge .
Similarly matrices for blurring/sharpening, how do we actually know that they will perform what they are aimed for.
do i simply takes peoples word for it?
Please help/ i feel helpless since i am not able to understand convolution and how the matrices detects edges or shapes
Filters detect spatial patterns such as edges in an image by detecting the changes in intensity values of the image.
A quick recap: In terms of an image, a high-frequency image is the one where the intensity of the pixels changes by a large amount, while a low-frequency image the one where the intensity is almost uniform. An image has both high and low frequency components. The high-frequency components correspond to the edges of an object because at the edges the rate of change of intensity of pixel values is high.
High pass filters are used to enhance the high-frequency parts of an image.
Let's take an example that a part of your image has pixel values as [[10, 10, 0], [10, 10, 0], [10, 10, 0]] indicating the image pixel values are decreasing toward the right i.e. the image changes from light at the left to dark at the right. The filter used here is [[1, 0, -1], [1, 0, -1], [1, 0, -1]].
Now, we take the convolutional of these two matrices that give the output [[10, 0, 0], [10, 0, 0], [10, 0, 0]]. Finally, these values are summed up to give a pixel value of 30, which gives the variation in pixel values as we move from left to right. Similarly, we find the subsequent pixel values.
Here, you will notice that a rate of change of pixel values varies a lot from left to right thus a vertical edge has been detected. Had you used the filter [[1, 1, 1], [0, 0, 0], [-1, -1, -1]], you would get the convolutional output consisting of 0s only i.e. no horizontal edge present. In the similar ways, [[-1, 1], [-1, 1]] detects a vertical edge.
You can check more here in a lecture by Andrew Ng.
Edit: Usually, a vertical edge detection filter has bright pixels on the left and dark pixels on the right (or vice-versa). The sum of values of the filter should be 0 else the resultant image will become brighter or darker. Also, in convolutional neural networks, the filters are learned the same way as hyperparameters through backpropagation during the training process.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I am reading through Residual learning, and I have a question.
What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
Can someone provide simple example?
First up, it's important to understand what x, y and F are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.
x is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F is usually a conv layer (conv+relu+batchnorm in this paper), and y combines the two together (forming the output channel). The result of F is also of rank 4, and most of dimensions will be the same as in x, except for one. That's exactly what the transformation should patch.
For example, x shape might be (64, 32, 32, 3), where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x) might be (64, 32, 32, 16): batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.
So, in order for y=F(x)+x to be a valid operation, x must be "reshaped" from (64, 32, 32, 3) to (64, 32, 32, 16).
I'd like to stress here that "reshaping" here is not what numpy.reshape does.
Instead, x[3] is padded with 13 zeros, like this:
pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x dimensions are changed.
Here's the link to the code in Tensorflow that does this.
A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x is the vector of N input features and W is an M-byN matrix, then the matrix product Wx yields M new features where each one is a linear projection of x. Each row of W is a set of weights that defines one of the M linear projections (i.e., each row of W contains the coefficients for one of the weighted sums of x).
In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios
The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.
Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).
I want to shear an image along the y-axis, but keeping a given x value in the same height.
If I use a shear matrix as:
[1, 0, 0]
[s, 1, 0]
with s = the shear factor with the function "warpAffine", I do get my sheared image, but it's always with origin at 0,0.
I need to have my origin at another x (and maybe even y).
So I think I should combine translation matrices, to translate the image to the left, perform the shear and translate the image back to the right, but I am unsure how to combine these matrices, since:
[1, 0, dx] [1, 0, 0] [1, 0, -dx]
[0, 1, dy] * [s, 1, 0] * [0, 1, -dy]
obviously won't work.
How can I perform a shearing operation on a chosen origin other than 0,0?
you are correct that you should combine those transforms. To combine them convert all of them to 3x3 matrices by adding [0,0,1] as the 3rd row. Then you can multiply all 3 matrices. When you finish the bottom row will still be [0,0,1], you can then remove it and use the resulting 2x3 matrix as your combined transform.