How to calculate the output size of a convoluitonal layer in YOLO? - machine-learning

This is the architecture of YOLO. I am trying to calculate the output size of each layer myself, but I can't get the size as described in the paper.
For example, in the first Conv Layer, the input size is 448x448 but it uses a 7x7 filter with stride 2, but according to this equation W2=(W1−F+2P)/S+1 = (448 - 7 + 0)/2 + 1, I can't get an integer result, so the filter size seems to be unsuitable to the input size.
So anyone can explain this problem? Did I miss something or misunderstand the YOLO architecture?

As Hawx Won said, the input image has been added extra 3 paddings, and here is how it works from the source code.
For convolution layers, if pad is enabled, The padding value of each layer will be calculated by:
# In parser.c
if(pad) padding = size/2;
# In convolutional_layer.c
l.pad = padding;
Where size is the shape of the filter.
So, for the first layer: padding = size/2 = 7/2=3
Then the output of first convolutional layer should be:
output_w = (input_w+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224
output_h = (input_h+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224

Well, I spent some time learning the source code, and learned about that the input image has added extra 3 paddings on top,down,left and right side of the image, so the image size becomes (448+2x3)=454, the out put size of valid padding should be calculated in this way:
Output_size=ceil((W-F+1)/S)=(454-7+1)/2=224, therefore, output size should be 224x224x64
I hope this could be helpful

Related

Placing a shape inside another shape using opencv

I have two images and I need to place the second image inside the first image. The second image can be resized, rotated or skewed such that it covers a larger area of the other images as possible. As an example, in the figure shown below, the green circle need to be placed inside the blue shape:
Here the green circle is transformed such that it covers a larger area. Another example is shown below:
Note that there may be some multiple results. However, any similar result is acceptable as shown in the above example.
How do I solve this problem?
Thanks in advance!
I tested the idea I mentioned earlier in the comments and the output is almost good. It may be better but it takes time. The final code was too much and it depends on one of my old personal projects, so I will not share. But I will explain step by step how I wrote such an algorithm. Note that I have tested the algorithm many times. Not yet 100% accurate.
for N times do this:
1. Copy from shape
2. Transform it randomly
3. Put the shape on the background
4-1. It is not acceptable if the shape exceeds the background. Go to
the first step.
4.2. Otherwise we will continue to step 5.
5. We calculate the length, width and number of shape pixels.
6. We keep a list of the best candidates and compare these three
parameters (W, H, Pixels) with the members of the list. If we
find a better item, we will save it.
I set the value of N to 5,000. The larger the number, the slower the algorithm runs, but the better the result.
You can use anything for Transform. Mirror, Rotate, Shear, Scale, Resize, etc. But I used warpPerspective for this one.
im1 = cv2.imread(sys.path[0]+'/Back.png')
im2 = cv2.imread(sys.path[0]+'/Shape.png')
bH, bW = im1.shape[:2]
sH, sW = im2.shape[:2]
# TopLeft, TopRight, BottomRight, BottomLeft of the shape
_inp = np.float32([[0, 0], [sW, 0], [sW, sH], [0, sH]])
cx = random.randint(5, sW-5)
ch = random.randint(5, sH-5)
o = 0
# Random transformed output
_out = np.float32([
[random.randint(-o, cx-1), random.randint(1-o, ch-1)],
[random.randint(cx+1, sW+o), random.randint(1-o, ch-1)],
[random.randint(cx+1, sW+o), random.randint(ch+1, sH+o)],
[random.randint(-o, cx-1), random.randint(ch+1, sH+o)]
])
# Transformed output
M = cv2.getPerspectiveTransform(_inp, _out)
t = cv2.warpPerspective(shape, M, (bH, bW))
You can use countNonZero to find the number of pixels and findContours and boundingRect to find the shape size.
def getSize(msk):
cnts, _ = cv2.findContours(msk, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
cnts.sort(key=lambda p: max(cv2.boundingRect(p)[2],cv2.boundingRect(p)[3]), reverse=True)
w,h=0,0
if(len(cnts)>0):
_, _, w, h = cv2.boundingRect(cnts[0])
pix = cv2.countNonZero(msk)
return pix, w, h
To find overlaping of back and shape you can do something like this:
make a mask from back and shape and use bitwise methods; Change this section according to the software you wrote. This is just an example :)
mskMix = cv2.bitwise_and(mskBack, mskShape)
mskMix = cv2.bitwise_xor(mskMix, mskShape)
isCandidate = not np.any(mskMix == 255)
For example this is not a candidate answer; This is because if you look closely at the image on the right, you will notice that the shape has exceeded the background.
I just tested the circle with 4 different backgrounds; And the results:
After 4879 Iterations:
After 1587 Iterations:
After 4621 Iterations:
After 4574 Iterations:
A few additional points. If you use a method like medianBlur to cover the noise in the Background mask and Shape mask, you may find a better solution.
I suggest you read about Evolutionary Computation, Metaheuristic and Soft Computing algorithms for better understanding of this algorithm :)

Detect handwritten characters bounded by a box with OpenCV

I am trying to read a handwritten form which has boxed-input.
I have run tesseract on the image but get strange results. In my understanding, I suppose the best thing to do is to detect the bounding box and minus that from the image. What's the best way to detect the box (semi-box around the character). I tried cv2.HoughLines(), but with no result.
I am new to OpenCV. It will be really helpful if someone can help me out here.
Thanks for your idea. I just realized probably i can look at counting the vertical pixels and greater than certain threshold
def get_pixel_count_in_col(img,col):
count=0
for j in range(img.shape[0]):
if(img[j,col]<255):
count=count+1
return count
def cleanup_img(img):
foundlines=[]
for i in range(img.shape[1]):
if(get_pixel_count_in_col(img,i)>img.shape[0]*0.7):
foundlines.append(i)
if(get_pixel_count_in_col(img,i-1)>img.shape[0]*0.25):
foundlines.append(i-1)
if(get_pixel_count_in_col(img,i+1)>img.shape[0]*0.25):
foundlines.append(i+1)
return np.delete(img,foundlines,1)
The resulting image makes more sense. But is there any other easy way to do this ?
It seems that your input format is quite clean and consistent. You can simply hard-code the width of each box in pixels and crop out the characters. However if the input format is not fixed then we can extend this answer to handle that as well(it would be bit expensive), so as the first attempt we would simply go with hard coding the width of boxes in pixels.
def get_image_chunks(img, size):
chunks = []
# To remove black borders
padding = 2
for i in xrange(0, img.shape[1], size):
col_start = i + padding
col_end = i + size - padding
# Slicing the numpy array.
chunks.append(img[:-padding, col_start:col_end])
return chunks
img = cv2.imread("/Users/anmoluppal/Downloads/GLUmJ.jpg", 0)
chunks = get_image_chunks(img, 42)
Outputs:
;
;

OpenImaj: EigenImages, input images dimension cannot be difference?

VFSGroupDataset<FImage> dataset = new VFSGroupDataset<FImage>(
"zip:file:/Users/nhnguyen/Data/newArchive.zip",
ImageUtilities.FIMAGE_READER);
int nTraining = 50;
int nTesting = 5;
GroupedRandomSplitter<String, FImage> splits =
new GroupedRandomSplitter<String, FImage>(dataset, nTraining, 0, nTesting);
GroupedDataset<String, ListDataset<FImage>, FImage> training = splits.getTrainingDataset();
GroupedDataset<String, ListDataset<FImage>, FImage> testing = splits.getTestDataset();
List<FImage> basisImages = DatasetAdaptors.asList(training);
int nEigenvectors = 100;
EigenImages eigen = new EigenImages(nEigenvectors);
eigen.train(basisImages);
I have the above code to test the EigenImages tutorial with my own set of data. What I am stuck at is that it would throw Exception with Matrix if in my data set, images are varies of dimension, say 92x112 and 100x100 and so on... When I do a batch resize to a same size then it work, however, these distort the image a little bit which I worried will affect the accuracy.
Is there away to train the eigen recognize to accept input with various dimension?
No, the Eigenfaces approach inherently requires that all images are the same size and are also at least approximately aligned (i.e. same orientation, eyes in about the same place).
You might however be able to automate the scaling and alignment by using one of the OpenIMAJ FaceAligner implementations.

How to merge two image without losing intensity in opencv

I have two images in opencv: Image A and Image B.
Image A is output frame from camera.
Image B is alpha transparent image obtained by masking one image.
Before masking Image B it is warped with cvWarpPerspective()
I tried cvAddWeighted() - It looses intensity when you give alpha and beta value
I tried aishack - Even here you looses overall intensity of Output Image
I tried silveiraneto.net - Not helpful in my case
Please help me out with something where I don't lose intensity in the output image after blending.
Thanks in advance
When you say, that you lose intensity... you leave the question about, how you lose it?
Do you loose intensity in the sense:
That when you add the images you hit a maximum intensity, and the rest is discarded.
(Example for a 8 bit pixel addition: Pix1 = 200 i, Pix2 = 150 i. "Pix1 + Pix2 = 350" but max value at 255, so Pix1 + Pix2 = 255)
That the former values of image A is compromised by adding it to Image B, which only covers some parts of the image.
(Example for an 8 bit image: Pix1 = 200 i, Pix2 = 150, (Pix1 + Pix2)/2 = 175, but when the value of a pixel of the second image is zero, Pix2 = 0. Then (Pix1 + Pix2)/2 = 100, which is half the value of the original image)
One of these observations should tell you about what you need to do.
I don't quite know, in accordance to the functions you mentioned, which approach they use.
I finally got the answer.It consist of 5 steps....
Step - 1
cvGetPerspectiveTransform(q,pnt,warp_matrix);
//where pnt is four point x and y cordinates and warp_matrix is a 3 x 3 matrix
Step - 2
cvWarpPerspective(dst2, neg_img, warp_matrix,CV_INTER_LINEAR)
//dst2 is overlay image ,neg_img is a blank image
Step - 3
cvSmooth(neg_img,neg_img,CV_MEDIAN); //smoothing the image
Step - 4
cvThreshold(neg_img, cpy_img, 0, 255, CV_THRESH_BINARY_INV);
//cpy_img is a created image from image_n
Step - 5
cvAnd(cpy_img,image_n,cpy_img);// image_n is a input image
cvOr(neg_img,cpy_img,image_n);
Output - image_n (without loosing intensity of input image)

Sliding window using as_strided function in numpy?

As I get to implement a sliding window using python to detect objects in still images, I get to know the nice function:
numpy.lib.stride_tricks.as_strided
So I tried to achieve a general rule to avoid mistakes I may fail in while changing the size of the sliding windows I need. Finally I got this representation:
all_windows = as_strided(x,((x.shape[0] - xsize)/xstep ,(x.shape[1] - ysize)/ystep ,xsize,ysize), (x.strides[0]*xstep,x.strides[1]*ystep,x.strides[0],x.strides[1])
which results in a 4 dim matrix. The first two represents the number of windows on the x and y axis of the image. and the others represent the size of the window (xsize,ysize)
and the step represents the displacement from between two consecutive windows.
This representation works fine if I choose a squared sliding windows. but still I have a problem in getting this to work for windows of e.x. (128,64), where I get usually unrelated data to the image.
What is wrong my code. Any ideas? and if there is a better way to get a sliding windows nice and neat in python for image processing?
Thanks
There is an issue in your code. Actually this code work good for 2D and no reason to use multi dimensional version (Using strides for an efficient moving average filter). Below is a fixed version:
A = np.arange(100).reshape((10, 10))
print A
all_windows = as_strided(A, ((A.shape[0] - xsize + 1) / xstep, (A.shape[1] - ysize + 1) / ystep, xsize, ysize),
(A.strides[0] * xstep, A.strides[1] * ystep, A.strides[0], A.strides[1]))
print all_windows
Check out the answers to this question: Using strides for an efficient moving average filter. Basically strides are not a great option, although they work.
For posteriority:
This is implemented in scikit-learn in the function sklearn.feature_extraction.image.extract_patches.
I had a similar use-case where I needed to create sliding windows over a batch of multi-channel images and ended up coming up with the below function. I've written a more in-depth blog post covering this in regards to manually creating a Convolution layer. This function implements the sliding windows and also includes dilating or adding padding to the input array.
The function takes as input:
input - Size of (Batch, Channel, Height, Width)
output_size - Depends on usage, comments below.
kernel_size - size of the sliding window you wish to create (square)
padding - amount of 0-padding added to the outside of the (H,W) dimensions
stride - stride the sliding window should take over the inputs
dilate - amount to spread the cells of the input. This adds 0-filled rows/cols between elements
Typically, when performing forward convolution, you do not need to perform dilation so your output size can be found be using the following formula (replace x with input dimension):
(x - kernel_size + 2 * padding) // stride + 1
When performing the backwards pass of convolution with this function, use a stride of 1 and set your output_size to the size of your forward pass's x-input
Sample code with an example of using this function can be found at this link.
def getWindows(input, output_size, kernel_size, padding=0, stride=1, dilate=0):
working_input = input
working_pad = padding
# dilate the input if necessary
if dilate != 0:
working_input = np.insert(working_input, range(1, input.shape[2]), 0, axis=2)
working_input = np.insert(working_input, range(1, input.shape[3]), 0, axis=3)
# pad the input if necessary
if working_pad != 0:
working_input = np.pad(working_input, pad_width=((0,), (0,), (working_pad,), (working_pad,)), mode='constant', constant_values=(0.,))
in_b, in_c, out_h, out_w = output_size
out_b, out_c, _, _ = input.shape
batch_str, channel_str, kern_h_str, kern_w_str = working_input.strides
return np.lib.stride_tricks.as_strided(
working_input,
(out_b, out_c, out_h, out_w, kernel_size, kernel_size),
(batch_str, channel_str, stride * kern_h_str, stride * kern_w_str, kern_h_str, kern_w_str)
)

Resources