This is the architecture of YOLO. I am trying to calculate the output size of each layer myself, but I can't get the size as described in the paper.
For example, in the first Conv Layer, the input size is 448x448 but it uses a 7x7 filter with stride 2, but according to this equation W2=(W1−F+2P)/S+1 = (448 - 7 + 0)/2 + 1, I can't get an integer result, so the filter size seems to be unsuitable to the input size.
So anyone can explain this problem? Did I miss something or misunderstand the YOLO architecture?
As Hawx Won said, the input image has been added extra 3 paddings, and here is how it works from the source code.
For convolution layers, if pad is enabled, The padding value of each layer will be calculated by:
# In parser.c
if(pad) padding = size/2;
# In convolutional_layer.c
l.pad = padding;
Where size is the shape of the filter.
So, for the first layer: padding = size/2 = 7/2=3
Then the output of first convolutional layer should be:
output_w = (input_w+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224
output_h = (input_h+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224
Well, I spent some time learning the source code, and learned about that the input image has added extra 3 paddings on top,down,left and right side of the image, so the image size becomes (448+2x3)=454, the out put size of valid padding should be calculated in this way:
Output_size=ceil((W-F+1)/S)=(454-7+1)/2=224, therefore, output size should be 224x224x64
I hope this could be helpful
I have a problem where all images have the same object; however, these objects can either have number_of_colors<=3 OR number_of_colors>3 and images are not labeled.
My attempt starts by converting RGB to LAB and Consider only the A & B to find the color coverage of that image. I was thinking of it as an area on the AB space. So for every image, I found the range of A and B (i.e max(A)-min(A), max(B)-min(B)) and multiplied them together to get the area, assuming it's a rectangle. Then I threshold using this feature.
Please let me know if intuition is correct and why it isn't working. Here is the confusion matrix:
TP: 0.41935483871, FN: 0.0645161290323
FP: 0.0967741935484, TN: 0.41935483871
Here is the basic routine the should work per image
LAB = rgb_to_lab(data_rgb[...,0],data_rgb[...,1],data_rgb[...,2])
A = LAB[1]
B = LAB[2]
minA,maxA = np.min(A),np.max(A)
minB,maxB = np.min(B),np.max(B)
diff_A = maxA - minA
diff_B = maxB - minB
area = diff_A * diff_B
# p is the normalized area
p = area/(200*200.)
# Found 0.53 to be the best possible threshold
if p >0.53:
print 'class 1 - colors > 3'
else:
print 'class 2 - colors <= 3'
Edit 1:
Here is an image to show how the threshold separates positive and negative images
Edit 2:
This shows the same plot but only considering A and B values with luminance between 16 and 70 which seems to increase the area of separation reduces the FP by 1.
In this tutorial, the output volumes are stated in output [25], and the receptive fields are specified in output [26].
Okay, the input volume [3, 227, 227] gets convolved with the region of size [3, 11, 11].
Using this formula (W−F+2P)/S+1, where:
W = the input volume size
F = the receptive field size
P = padding
S = stride
...results with (227 - 11)/4 + 1 = 55 i.e. [55*55*96]. So far so good :)
For 'pool1' they used F=3and S=2 I think? The calculation checks out: 55-3/2+1=27.
From this point I get a bit confused. The receptive field for the second convnet layer is [48, 5, 5], yet the output for 'conv2' is equal to [256, 27, 27]. What calculation happened here?
And then, the height and width of the output volumes of 'conv3' to 'conv4' are all the same [13, 13]? What's going on?
Thanks!
If you look closely at the parameters of conv2 layer you'll notice
pad: 2
That is, the input blob is padded by 2 extra pixels all around, thus the formula now is
27 + 2 + 2 - ( 5 - 1 ) = 27
Padding a kernel size of 5 with 2 pixels from both sides yields the same output size.
Tell me please what values should I set for D3DRS_POINTSCALE_A, D3DRS_POINTSCALE_B, D3DRS_POINTSCALE_С to point sprites scaled just like other objects in the scene. The parameters A = 0 B = 0 and C = 1 (proposed by F. D. Luna) not suitable because the scale is not accurate enough and the distance between the particles (point sprites) can be greater than it should be. If I replace the point sprites to billboards, the scale of the particles is correct, but the rendering is much slower. Help me please because the speed of rendering particles for my task is very important but the precise of their scale is very important too.
Direct3D computes the screen-space point size according to the following formula:
MSDN - Point Sprites I can not understand what values should be set for A, B, C to scaling was correct
P.S. Sorry for my english I'm from Russia
Directx uses this function to determine scaled size of a point:
out_scale = viewport_height * in_scale * sqrt( 1/( A + B * eye_distance + C * (eye_distance^2) ) )
eye_distance is generated by:
eye_position = sqrt(X^2 + Y^2 + Z^2)
So to answer your question:
D3DRS_POINTSCALE_A is the constant
D3DRS_POINTSCALE_B is the Linear element (scales eye_distance) and
D3DRS_POINTSCALE_C is the quadratic element (scales eye_distance squared).
As I get to implement a sliding window using python to detect objects in still images, I get to know the nice function:
numpy.lib.stride_tricks.as_strided
So I tried to achieve a general rule to avoid mistakes I may fail in while changing the size of the sliding windows I need. Finally I got this representation:
all_windows = as_strided(x,((x.shape[0] - xsize)/xstep ,(x.shape[1] - ysize)/ystep ,xsize,ysize), (x.strides[0]*xstep,x.strides[1]*ystep,x.strides[0],x.strides[1])
which results in a 4 dim matrix. The first two represents the number of windows on the x and y axis of the image. and the others represent the size of the window (xsize,ysize)
and the step represents the displacement from between two consecutive windows.
This representation works fine if I choose a squared sliding windows. but still I have a problem in getting this to work for windows of e.x. (128,64), where I get usually unrelated data to the image.
What is wrong my code. Any ideas? and if there is a better way to get a sliding windows nice and neat in python for image processing?
Thanks
There is an issue in your code. Actually this code work good for 2D and no reason to use multi dimensional version (Using strides for an efficient moving average filter). Below is a fixed version:
A = np.arange(100).reshape((10, 10))
print A
all_windows = as_strided(A, ((A.shape[0] - xsize + 1) / xstep, (A.shape[1] - ysize + 1) / ystep, xsize, ysize),
(A.strides[0] * xstep, A.strides[1] * ystep, A.strides[0], A.strides[1]))
print all_windows
Check out the answers to this question: Using strides for an efficient moving average filter. Basically strides are not a great option, although they work.
For posteriority:
This is implemented in scikit-learn in the function sklearn.feature_extraction.image.extract_patches.
I had a similar use-case where I needed to create sliding windows over a batch of multi-channel images and ended up coming up with the below function. I've written a more in-depth blog post covering this in regards to manually creating a Convolution layer. This function implements the sliding windows and also includes dilating or adding padding to the input array.
The function takes as input:
input - Size of (Batch, Channel, Height, Width)
output_size - Depends on usage, comments below.
kernel_size - size of the sliding window you wish to create (square)
padding - amount of 0-padding added to the outside of the (H,W) dimensions
stride - stride the sliding window should take over the inputs
dilate - amount to spread the cells of the input. This adds 0-filled rows/cols between elements
Typically, when performing forward convolution, you do not need to perform dilation so your output size can be found be using the following formula (replace x with input dimension):
(x - kernel_size + 2 * padding) // stride + 1
When performing the backwards pass of convolution with this function, use a stride of 1 and set your output_size to the size of your forward pass's x-input
Sample code with an example of using this function can be found at this link.
def getWindows(input, output_size, kernel_size, padding=0, stride=1, dilate=0):
working_input = input
working_pad = padding
# dilate the input if necessary
if dilate != 0:
working_input = np.insert(working_input, range(1, input.shape[2]), 0, axis=2)
working_input = np.insert(working_input, range(1, input.shape[3]), 0, axis=3)
# pad the input if necessary
if working_pad != 0:
working_input = np.pad(working_input, pad_width=((0,), (0,), (working_pad,), (working_pad,)), mode='constant', constant_values=(0.,))
in_b, in_c, out_h, out_w = output_size
out_b, out_c, _, _ = input.shape
batch_str, channel_str, kern_h_str, kern_w_str = working_input.strides
return np.lib.stride_tricks.as_strided(
working_input,
(out_b, out_c, out_h, out_w, kernel_size, kernel_size),
(batch_str, channel_str, stride * kern_h_str, stride * kern_w_str, kern_h_str, kern_w_str)
)