How to label training data for YOLO - machine-learning

I am having a question on how to label training data for YOLO algorithm.
Let's say that each label Y, we need to specify [Pc, bx, by, bh, bw], where Pc is the indicator for presence(1=present, 0=not present), (bx, by) is relative position of the center of the object-of-interest, and (bh, bw) is the relative dimension of the bounding box containing the object.
Using picture below as an example, the cell (1,2), which contains a black car, should have a label Y = [1, 0.4, 0.3, 0.9, 0.5]. And for any cells without cars, they should have a label [0, ?, ?, ?, ?]
[]1
But if we have a finer grid like this, where the dimension of each cells is smaller than the ground truth bounding box.
Let's say that the ground truth bounding box for the car is the red box, and the ground truth center point is the red dot, which is in cell 2.
For cell 2 it will have label Y = [1, 0.9, 0.1, 2, 2], is this correct?
And for cell 1, 3, 4 what kind of label will they have? Do they have Pc=1 or Pc =0? And if Pc=1, how will the bx and by be? (As i remember that bx, by should have value between 0 and 1. But in cell 1,3,4, there is no center point of the object-of-interest)

In effect the midpoint it's contained in cell 2. Cells 1,3,4 will shown a Pc=0 according to the Y.O.L.O. algorithm which only takes in count the cell that contains the midpoint and calculates the bounding box, as you mentioned, with by bx, by, bh, bw. With the proposition of Y = [1, 0.9, 0.1, 2, 2], I would think that if you take the point (0,0) as the left-upper corner, Y would be more likely to be Y = [1,0.2,0.57,0.21,0.15], being calculated with a grid of 19 x 19 as below.

Related

How to find exact displacement of each pixel after using torch.nn.functional.grid_sample()

I have input image and grid passed in torch.nn.functional.grid_sample(). Now if I have a random pixel location (x, y) from the input image, how can I find out its location in the output of grid_sample().
To be precise I am looking for the delta of each pixel in terms of coordinates.
Would this be sufficient for finding new location of pixel:
ix = ((ix + 1) / 2) * (IW-1);
iy = ((iy + 1) / 2) * (IH-1);
as mentioned in https://github.com/pytorch/pytorch/blob/f064c5aa33483061a48994608d890b968ae53fb5/aten/src/THNN/generic/SpatialGridSamplerBilinear.c
How did you compute the grid? It must be based on some transform. Often, the affine_grid function is used. And this function takes the transformation matrix as input.
Given this transformation matrix (and its inverse), you can go in both directions: from input image pixel location to output image pixel location, and the other way round.
Here a sample code showing how to compute the transforms both for forward and backward direction. In the last line you see how to map a pixel location in both directions.
import torch
import torch.nn.functional as F
# given a transform mapping from output to input, create the sample grid
input_tensor = torch.zeros([1, 1, 2, 2]) # batch x channels x height x width
transform = torch.tensor([[[0.5, 0, 0], [0, 1, 3]]]).float()
grid = F.affine_grid(transform, input_tensor.size(), align_corners=True)
# show the grid
print('GRID')
print('y', grid[0, ..., 0])
print('x', grid[0, ..., 1])
# compute both transformation matrices (forward and backward) with shape 3x3
print('TRANSFORM AND INVERSE')
transform_full = torch.zeros([1, 3, 3])
transform_full[0, 2, 2] = 1
transform_full[0, :2, :3] = transform
transform_inv_full = torch.inverse(transform_full)
print(transform_full)
print(transform_inv_full)
# map pixel location x=2, y=3 in both directions (forward and backward)
print('TRANSFORMED PIXEL LOCATIONS')
print(transform_full#torch.tensor([[2, 3, 1]]).float().T)
print(transform_inv_full#torch.tensor([[2, 3, 1]]).float().T)

Find vertical 1px lines with OpenCV

I have an image something like the image below (on the left):
I want to extract only the pixels in red on the right: the pixels that belong to a 1px vertical line, but not to any thicker line or other region with more than 1 adjacent black pixel. The image is bitonal.
I have so far tried a morphology OPEN with a vertical (10px, which is find for my purposes) and horizontal kernel and taken the difference, but this needs an awkward shift and leaves some "speckles":
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 10))
vertical_mask1 = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel,
iterations=1)
horz_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1))
horz_mask = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horz_kernel,
iterations=1)
M = np.float32([[1,0,-1],[0,1,1]])
rows, cols = horz_mask.shape
vertical_mask = cv2.warpAffine(horz_mask, M, (cols, rows))
result = cv2.bitwise_and(thresh, cv2.bitwise_not(horz_mask))
What is the correct way to isolate the 1px lines (and only the 1px lines)?
In the general case, for other kernels, this question is: how do I find all pixels in the image that are in regions that the kernel "fits inside" (and then a subtraction to get my desired result)?
That's basically (binary) template matching. You need to derive proper templates from your "kernels". For larger "kernels", that might involve using masks for these templates, too, cf. cv2.matchTemplate.
What's the most important feature for a single pixel vertical line? The left and right neighbour of the current pixel must be 0. So, the template to match is [0, 1, 0]. By using the TemplateMatchMode cv2.TM_SQDIFF_NORMED, perfect matches will lead to close to 0 values in the result array.
You can mask those locations, and dilate according to the size of your template. Then, you use bitwise_and to extract the actual pixels that belong to your template.
Here's some code with a few template ("kernels"):
import cv2
import numpy as np
img = cv2.imread('AapJk.png', cv2.IMREAD_GRAYSCALE)[:, :50]
vert_line = np.array([[0, 1, 0]], np.uint8)
cross = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]], np.uint8)
corner = np.array([[0, 0, 1], [0, 0, 1], [1, 1, 1]], np.uint8)
for i_k, k in enumerate([vert_line, cross, corner]):
m, n = k.shape
img_tmp = 1 - img // 255
mask = cv2.matchTemplate(img_tmp, k, cv2.TM_SQDIFF_NORMED) < 10e-6
mask = cv2.dilate(mask.astype(np.uint8), np.ones((m, n)), anchor=(n-1, m-1))
m, n = mask.shape
mask = cv2.bitwise_and(img_tmp[:m, :n], mask)
out = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
roi = out[:m, :n]
roi[mask.astype(bool), :] = [0, 0, 255]
cv2.imwrite('{}.png'.format(i_k), out)
Vertical line:
Cross:
Bottom right corner 3 x 3:
Larger templates ("kernels") most likely will require additional masks, depending on how many or which neighbouring pixels should be considered or not.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.19041-SP0
Python: 3.9.1
PyCharm: 2021.1.3
NumPy: 1.20.3
OpenCV: 4.5.2
----------------------------------------

Does every occurence of an object needs to be labelled when annotating them for Yolo or image recognition in general?

I need to prepare a dataset for stacked books. Say in an image, there are 5 books but I only create bounding boxes for 4 of them. Will that affect the performance of my model in any way?
It is hard to create bounding boxes for stacked books when they are placed at a weird angle and I might've missed some of the books when drawing boxes because I got distracted by the number of lines. Plus the fact that the bounding boxes are perfectly flat on the axes and when the books are slanted, the bounding box for one book can take up to a few books. Is this bad practice? is it okay if I just left some of the books unboxed?
Lastly, if you train your model on just the book individually (not stacked) will they be able to be detected once they are stacked up and over half of the book is covered by other books?
Although the bounding boxes might overlap, if its only books you want to detect, you should annotate and display all books as it can help increase the reliability of your dataset, especially if you are using a small number of images. However, you can always try and color splash the relevant pixels on your image, which is done in the Mask RCNN repo.
Below is the function that Mask RCNN uses in the visualization file.
def display_instances(image, boxes, masks, class_ids, class_names,
scores=None, title="",
figsize=(16, 16), ax=None,
show_mask=True, show_bbox=True,
colors=None, captions=None):
"""
boxes: [num_instance, (y1, x1, y2, x2, class_id)] in image coordinates.
masks: [height, width, num_instances]
class_ids: [num_instances]
class_names: list of class names of the dataset
scores: (optional) confidence scores for each box
title: (optional) Figure title
show_mask, show_bbox: To show masks and bounding boxes or not
figsize: (optional) the size of the image
colors: (optional) An array or colors to use with each object
captions: (optional) A list of strings to use as captions for each object
"""
# Number of instances
N = boxes.shape[0]
if not N:
print("\n*** No instances to display *** \n")
else:
assert boxes.shape[0] == masks.shape[-1] == class_ids.shape[0]
# If no axis is passed, create one and automatically call show()
auto_show = False
if not ax:
_, ax = plt.subplots(1, figsize=figsize)
auto_show = True
# Generate random colors
colors = colors or random_colors(N)
# Show area outside image boundaries.
height, width = image.shape[:2]
ax.set_ylim(height + 10, -10)
ax.set_xlim(-10, width + 10)
ax.axis('off')
ax.set_title(title)
masked_image = image.astype(np.uint32).copy()
for i in range(N):
color = colors[i]
# Bounding box
if not np.any(boxes[i]):
# Skip this instance. Has no bbox. Likely lost in image cropping.
continue
y1, x1, y2, x2 = boxes[i]
if show_bbox:
p = patches.Rectangle((x1, y1), x2 - x1, y2 - y1, linewidth=2,
alpha=0.7, linestyle="dashed",
edgecolor=color, facecolor='none')
ax.add_patch(p)
# Label
if not captions:
class_id = class_ids[i]
score = scores[i] if scores is not None else None
label = class_names[class_id]
caption = "{} {:.3f}".format(label, score) if score else label
else:
caption = captions[i]
ax.text(x1, y1 + 8, caption,
color='w', size=11, backgroundcolor="none")
# Mask
mask = masks[:, :, i]
if show_mask:
masked_image = apply_mask(masked_image, mask, color)
# Mask Polygon
# Pad to ensure proper polygons for masks that touch image edges.
padded_mask = np.zeros(
(mask.shape[0] + 2, mask.shape[1] + 2), dtype=np.uint8)
padded_mask[1:-1, 1:-1] = mask
contours = find_contours(padded_mask, 0.5)
for verts in contours:
# Subtract the padding and flip (y, x) to (x, y)
verts = np.fliplr(verts) - 1
p = Polygon(verts, facecolor="none", edgecolor=color)
ax.add_patch(p)
ax.imshow(masked_image.astype(np.uint8))
if auto_show:
plt.show()
I don't know what type of network you are using, which could contribute to how well you can detect a book that is turned or flat if you only train on the top part of a book.

Inverse Perspective Transform?

I am trying to find the bird's eye image from a given image. I also have the rotations and translations (also intrinsic matrix) required to convert it into the bird's eye plane. My aim is to find an inverse homography matrix(3x3).
rotation_x = np.asarray([[1,0,0,0],
[0,np.cos(R_x),-np.sin(R_x),0],
[0,np.sin(R_x),np.cos(R_x),0],
[0,0,0,1]],np.float32)
translation = np.asarray([[1, 0, 0, 0],
[0, 1, 0, 0 ],
[0, 0, 1, -t_y/(dp_y * np.sin(R_x))],
[0, 0, 0, 1]],np.float32)
intrinsic = np.asarray([[s_x * f / (dp_x ),0, 0, 0],
[0, 1 * f / (dp_y ) ,0, 0 ],
[0,0,1,0]],np.float32)
#The Projection matrix to convert the image coordinates to 3-D domain from (x,y,1) to (x,y,0,1); Not sure if this is the right approach
projection = np.asarray([[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 1]], np.float32)
homography_matrix = intrinsic # translation # rotation # projection
inv = cv2.warpPerspective(source_image, homography_matrix,(w,h),flags = cv2.INTER_CUBIC | cv2.WARP_INVERSE_MAP)
My question is, Is this the right approach, as I can manual set a suitable ty,rx, but not for the one (ty,rx) which is provided.
First premise: your bird's eye view will be correct only for one specific plane in the image, since a homography can only map planes (including the plane at infinity, corresponding to a pure camera rotation).
Second premise: if you can identify a quadrangle in the first image that is the projection of a rectangle in the world, you can directly compute the homography that maps the quad into the rectangle (i.e. the "birds's eye view" of the quad), and warp the image with it, setting the scale so the image warps to a desired size. No need to use the camera intrinsics. Example: you have the image of a building with rectangular windows, and you know the width/height ratio of these windows in the world.
Sometimes you can't find rectangles, but your camera is calibrated, and thus the problem you describe comes into play. Let's do the math. Assume the plane you are observing in the given image is Z=0 in world coordinates. Let K be the 3x3 intrinsic camera matrix and [R, t] the 3x4 matrix representing the camera pose in XYZ world frame, so that if Pc and Pw represent the same 3D point respectively in camera and world coordinates, it is Pc = R*Pw + t = [R, t] * [Pw.T, 1].T, where .T means transposed. Then you can write the camera projection as:
s * p = K * [R, t] * [Pw.T, 1].T
where s is an arbitrary scale factor and p is the pixel that Pw projects onto. But if Pw=[X, Y, Z].T is on the Z=0 plane, the 3rd column of R only multiplies zeros, so we can ignore it. If we then denote with r1 and r2 the first two columns of R, we can rewrite the above equation as:
s * p = K * [r1, r2, t] * [X, Y, 1].T
But K * [r1, r2, t] is a 3x3 matrix that transforms points on a 3D plane to points on the camera plane, so it is a homography.
If the plane is not Z=0, you can repeat the same argument replacing [R, t] with [R, t] * inv([Rp, tp]), where [Rp, tp] is the coordinate transform that maps a frame on the plane, with the plane normal being the Z axis, to the world frame.
Finally, to obtain the bird's eye view, you select a rotation R whose third column (the components of the world's Z axis in camera frame) is opposite to the plane's normal.

Image Processing of Function Graph

I would like to detect these points in this graph , also I wanted to detect the lines ,
I searched for edge detection and corner detection (such as Harris Corner Detector ), but I don't know how to handle such graph , I only need to know a sudo algorithm , or steps of going through such problem
Detect the vertices - segment by color( r>>max(g,b) ) and then apply the median or minimum filter of the appropriate size, or simply binary erode few times. Then just label the remaining connected blobs.
Detect the lines - use the simplified Hough Transform. Basicaly, draw a virtual line from the center of each vertix to all others and count the red pixels along the line. If there are plenty of them - the line exists, otherwise the two vertices are not connected.
Something like that:
import numpy as np
from scipy.misc import imshow, imsave, imread
from scipy.ndimage import filters, morphology, measurements
from skimage.draw import line
img = imread("laGK6.jpg")
r = img[:,:, 0]
g = img[:,:, 1]
b = img[:,:, 2]
mask = (r.astype(np.float)-np.maximum(g,b) ) > 20
mask2 = morphology.binary_erosion(mask)
mask2 = morphology.binary_erosion(mask2)
mask2 = morphology.binary_erosion(mask2)
mask2 = morphology.binary_erosion(mask2)
mask2 = morphology.binary_dilation(mask2)
label, numfeatures = measurements.label(mask2)
mc = measurements.center_of_mass(mask2, label, range(1,numfeatures+1) )
mask3 = np.zeros_like(mask2)
for p in mc:
mask3[p[0], p[1]]=255
arr = range(numfeatures)
connections=[]
for i in range( numfeatures):
arr.remove(i)
for j in arr:
rr,cc = line(mc[i][0], mc[i][1], mc[j][0], mc[j][1])
mask3[rr,cc]=255
ms = np.sum(mask[rr,cc]).astype(np.float)/len(rr)
if ms > 0.9:
connections.append((i,j))
print "vertices: ", mc
print "connections: ", connections
This outputs the following:
vertices: [(76.551724137931032, 288.72413793103448),
(76.568181818181813, 613.61363636363637), (138.72727272727272,
126.04545454545455), (139.33333333333334, 450.33333333333331), (265.18181818181819, 207.5151515151515), (264.96666666666664,
369.53333333333336), (265.41379310344826, 694.51724137931035), (265.51724137931035, 45.379310344827587), (327.57692307692309,
532.42307692307691)]
connections: [(0, 4), (0, 5), (1, 6), (1, 8), (2, 4), (2, 7), (3, 5), (3, 8)]
I am also working on a project that detects shapes in a drawing. I am not sure if it will solve your problem as well but here is what I have done for such problems.
I am assuming that you need the value of X and Y coordinates of those edge points
first thing you need is the X and Y values of the complete shape
next inside a loop put an if condition saying "get this point if
Y[i]<Y[i+1] and Y[i]<Y[i-1]". point whose next and previous points have Y greater than the value of current Y.
this condition will give you the X and Y values of the edge points.
Good Luck
If the graph is always the same color, and the vertices are always marked with squares, you can threshold the image by its color to detect lines and vertices. Then look for connected sets of pixels whose width and height are exactly the ones you can just measure.

Resources