Matching bounding boxes from multiple viewing angles - opencv

I have mounted two cameras on different shelves of a fridge (at the bottom and top), facing each other. An object detector is fed with two images from these two streams and returns bounding boxes from them, independently.
The problem: Given bounding boxes from different viewing angles, determine their correspondence.
I know that since depth is unknown, the x, y coordinates in one camera may correspond to multiple positions in the other. This is why our solutions hitherto have been approximations that work with varying success. Naive solution has been to ignore any difference between the camera coordinates and use the Euclidean distance to get correspondence.
Another solution is to use the Fundamental Matrix, which gives a way to calculate correspondence based on epipolar geometry. This solution may be quite tedious because it requires a form of calibration and the results have not been great. It may be due to the fact that my calibration was sloppy.
The last solution, which works poorly, is to use edge och keypoint detection and match them. Since the edges, shape etc. differs quite radically, we understand why this is so.
Ultimately, I was wondering how you would have tackled the problem or if there are anything you can point me towards to get a more robust solutions.
In the left image we have the bottom view. This image has been mirrored about y-axis. The right image is from the top shelf. The object detector has detected two bounding boxes for each viewing angle, a total of four bounding boxes. These are drawn on both cameras. For example, bounding box with id 16 is detected on the top-positioned camera and is also drawn on the bottom camera to indicate its displacement. How would you go about in matching the bounding boxes which belong to the same object viewed from another viewing angle?

Related

How to normalize position of the elements on the picture [OpenCV]

I am currently working on program which could help at my work. I'm trying to use Machine Learning for the classification purpose. The problem is that I don't have enough samples for training the model and augmentation is something I'm trying to avoid because hardware problems (not enough RAM) either on my company laptop and on the Google Collab. So I decided to try to somehow normalize the position of the elements so the differences would be visible for the machine even with no big amount of different samples. Unfortunately now I'm struggling how to normalize those pictures.
Element 1a:
Element 1b:
Element 2a:
Element 2b:
Elements 1a and 1b are the same type and 2a - 2b are the same type. Is there a way to somehow normalize position for those pictures (something like position 0) which would help the algorithm to see differences between them? I've tried using cv2.minAreaSquare to get the square position, rotating them and cropping don't needed area but unfortunately those elements can have different width so after scaling them down the contours are deformed unevenly. Then I was trying to get symmetry axis and using this to do a proper cropping after rotation but still the results didn't meet my expectations. I was thinking to add more normalization points like this:
Normalization Points:
And using this points normalize position of the rest of my elements but Perspective Transform takes only 4 points and with 4 points its also not very good methodology. Maybe you guys know a way how to move those elements to have them in the same positions.
Seeing the images, I believe that the transformation between two pictures is either an isometry (translation + rotation) or a similarity (translation + rotation + scaling). These can be determined with just two points. (Perspective takes four points but I think that this is overkill.)
But for good accuracy, you must make sure that the points are found reliably and precisely. In the first place, you need to guess which features of the shapes are repeatable from one sample to the next.
For example, you might estimate that the straight edges are always in the same relative position. In such a case, I would recommend finding two points on some edges, drawing a line between them and find intersections between the lines.
In the illustration, you find edge points along the red profiles, and from them you draw the green lines. They intersect in the yellow points.
For increased accuracy, you can use a least-squares approach to find a best fit on more than two points.

How to find hinge point or axis of rotation point from top view using image processing?

I have a problem at hand where I need to detect/predict the coordinates of the hinge point or axis of rotation point using image processing. The image is as shown below:
I've used a method where I started with tracking the circular movement (in an arc) of a few feature points in an RoI around the default hinge coordinates (entered manually) in a configuration file. This circular motion of these tracked points happens around the vertical axis which passes through the hinge point. Now, I tracked these points from their initial position until the connecting bar made a particular angle (15°/20°) with the y-axis, I drew secants between these different positions (start and end positions) of the same point and drew its perpendicular bisector, which will ideally pass through the centre of the (concentric) circles, which is the ideal hinge point.
Eg:
y_intercepts calculated for each point
H0 (322, 42)
H1 (322, 64) (within tolerance, closest to GT)
H2 (322, 48)
H_avg (322,52)
H_groundtruth (x,y): (322, 61)
We need an accuracy or tolerance of +/- 3 pixels.
Now, the issues we faced in this ideal scenario to practical working of it is:
Different tracked points give different potential hinge points (different dots on the vertical yellow line), (few of which are very close the ground truth(yellow circle)), but their weighted/average (big green circle) goes off the mark. Quite frankly, this is a problem of too many in which we do get the closest potentially to ground truth, but we’re not sure, which of these points is the closest as we’re not to use the default hitch coordinates (entered manually) from config file.
One solution could be to use frameworks already implemented for image registration such as elastix. If you configure it for a rigid registration, you can get the transformation matrix and therefore the center of the rotation.
The problem here is that only one part of your image is moving. Before doing the registration, I would simply mask the region of interest by calculating a mask from the subtraction of the two images, to keep only the part where something actually moved.
Such approach could get a subpixel accuracy. You could also repeat it for multiple angles and average the result. Alternatively to the averaging, you could use the RANSAC algorithm to know which hinge points are off (outliers) and exclude them.
Here is an example how to do a simple rigid transformation with elastix.
I hope this helps!
I intended this as only a comment, but it ended up significantly over the character limit:
The problem from an accuracy perspective (sorry, couldn't resist) seems to be that you're trying to use a planar euclidean geometry technique to solve a projective geometry problem.
Those feature tracks are only circular arcs in 3D world space. They're actually (noisy) elliptical arcs in 2D image pixel space due to the projection.
Your hinge rotation axis isn't a single pixel either, unless your camera's optical axis is directly aligned with the hinge axis. If that's not the case (as the perspective in the photo you added suggests), then your hinge axis is actually a line in pixel space, not a point, and different heights for the different tracks in model space will be 'centered' around different pixels on that line. So asking for +/- 3 pixel hinge 'point' accuracy is unclear, and so is measuring angles in pixel space in general in a way that doesn't account for perspective.
I only mention these details because you seem focused on measuring accurately. Often, those kinds of 2D approximations are fine for many applications, but high accuracy and precision from a single camera (if that's really what you need) requires better 3D scene understanding. (Or you could train a deep network with a bunch of labeled ground truth images and let it figure out the mappings.)
Now maybe you don't need such high accuracy for your application after all. In that case, simple affine geometry techniques like that mentioned in the other answer might work well enough.

Algorithm for selecting outer points on a graph ("rich" convex hull)

I'm looking for an efficient way of selecting a relatively large portion of points (2D Euclidian graph) that are the furthest away from the center. This resembles the convex hull, but would include (many) more points. Further criteria:
The number of points in the selection / set ("K") must be within a specified range. Most likely it won't be very narrow, but it most work for different ranges (eg. 0.01*N < K < 0.05*N as well as 0.1*N < K < 0.2*N).
The algorithm must be able to balance distance from the center and "local density". If there are dense areas near the upper part of the graph range, but sparse areas near the lower part, then the algorithm must make sure to select some points from the lower part even if they are closer to the center than the points in the upper region. (See example below)
Bonus: rather than simple distance from center, taking into account distance to a specific point (or both a point and the center) would be perfect.
My attempts so far have focused on using "pigeon holing" (divide graph into CxR boxes, assign points to boxes based on coordinates) and selecting "outer" boxes until we have sufficient points in the set. However, I haven't been successful at balancing the selection (dense regions over-selected because of fixed box size) nor at using a selected point as reference instead of (only) the center.
I've (poorly) drawn an Example: The red dots are the points, the green shape is an example of what I want (outside the green = selected). For sparse regions, the bounding shape comes closer to the center to find suitable points (but doesn't necessarily find any, if they're too close to the center). The yellow box is an example of what my Pigeon Holing based algorithms does. Even when trying to adjust for sparser regions, it doesn't manage well.
Any and all ideas are welcome!
I don't think there are any standard algorithms that will give you what you want. You're going to have to get creative. Assuming your points are embedded in 2D Euclidean space here are some ideas:
Iteratively compute several convex hulls. For example, compute the convex hull, keep the points that are part of the convex hull, then compute another convex hull ignoring the points from the original convex hull. Continue to do this until you have a sufficient number of points, essentially plucking off points on the perimeter for each iteration. The only problem with this approach is that it will not work well for concavities in your data set (e.g., the one on the bottom of your sample you posted).
Fit a Gaussian to your data and keep everything > N standard
deviations away from the mean (where N is a value that you'd have to
choose). This should work pretty well if your data is Gaussian. If
it isn't, you could always model it with several Gaussians (instead
of one), and keep points with a joint probability less than some threshold. Using multiple Gaussians will probably handle concavities decently. References:
http://en.wikipedia.org/wiki/Gaussian_function
How to fit a gaussian to data in matlab/octave?\
Use Kernel Density Estimation - If you create a kernel density
surface, you could slice the surface at some height (e.g., turning
it into a plateau), giving you a perimeter shape (the shape of the
plateau) around the points. The trick would be to slice it at the
right location though, because you could end up getting no points
outside of the shape, but with the right selection you could easily
get the green shape you drew. This approach will work well and give you the green shape in your example if you choose the slice point wisely (which may be difficult to do). The big drawback of this approach is that it is very computationally expensive. More information:
http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
Use alpha shapes to get a general shape the wraps tightly around
the outside perimeter of the point set. Then erode the shape a
little to force some points outside of the shape. I don't have a lot of experience with alpha shapes, but this approach will also be quite computationally expensive. More info:
http://doc.cgal.org/latest/Alpha_shapes_2/index.html

Get rectangle out of array of points

Using GPUImage, I am able to detect corners of a book/page in an image. But sometimes, it will pass more than 4 points, in which case I will need to process and figure out the best rectangle out of these points. Here's an example:
What's the most efficient way to figure out the best rectangle in this case?
Thanks
If you're using a corner detection algorithm, then you can filter results based on the relative strength of the detected corner. The contrast at the book corners relative to your current background appears to be much stronger than the contrast at the point found in the wood grain. Are there relative magnitudes associated with each point, or do you just get the points? Setting thresholds for edge strengths can mean a lot of fiddling unless the intensities of the foreground and background are relatively constant.
Your sample image could be blurred or morphed. For example, the right morphological "close" on light pixels could eliminate the texture in the wood grain without having an effect on the size and shape of the book. (http://en.wikipedia.org/wiki/Mathematical_morphology)
Another possibility is to shrink the image to a much smaller size and then perform detection on that. Resizing the image will tend to wipe out tiny details such as whatever wood grain pattern is currently being detected.
Picking the right lens and lighting can make the image easier to process. Try to simplify the image as much as possible before processing it. As mentioned above, "dark field" lighting that would illuminate just the book edges would present a much simpler image for processing. Writing down the constraints can make it more obvious which solution will be most robust and simplest to implement. Finding any rectangle anywhere in an image is very difficult; it's much easier to find a light rectangle on a dark background if the rectangle is at least 100 x 100 pixels in size, rotated no more than 15 degrees from square to the image edges, etc.
More involved solutions can be split into two approaches:
Solving the problem using given only 4 or more (x,y) points.
Using a different image processing technique altogether for the sample image.
1. Solving the program given only the points
If you generally only have 5 or 6 points, and if you are confident that 4 of those points will belong to the corners of the rectangles that you want, then you can try this:
Find the convex hull of all points. The convex hull is the N-gon that completely encompasses all points. If the points were pegs sticking up, and if you stretched a rubber band around them and let it snap into place, then the final shape of the rubber band is a convex hull. Algorithms that find convex hulls typically return a list of points that ordered counterclockwise from the bottom leftmost point.
Make a copy of your point list and remove points from the copy until only four points remain. These four remaining points will still be ordered counterclockwise.
Calculate the angle formed by each set of three successive points: points 1, 2, 3, then 2, 3, 4, then 3, 4, 1, and so on.
If an angle is outside a reasonable tolerance--less than 70 degrees or greater than 110 degrees--skip back to step 2 and remove the next point (or set of points).
Store the min and max angles for each set of 4 points.
Repeat steps 2 - 6, removing a different point (or points) each time.
Track the set of points for which the min and max angles are closest to 90 degrees.
http://en.wikipedia.org/wiki/Convex_hull
There are a number of other checks and constraints that could be introduced. For example, if the point-to-point distances for 3 successive points in the convex hull (pts N to N+1, and N+1 to N+2) are close to the expected width and height of the book, then you might mark these as known good points and only test the remaining points to see which is the fourth point.
The technique above can get unwieldy if you get quite a few points, but it may work if two or three of the book corner points are expected to be found on the convex hull.
For any geometric problem, I always recommend checking out GeometricTools.com, which has a lot of great, optimized source code for all sorts of problems. It's very handy to have the book as well, especially if you can find a cheap copy using AddAll.com.
http://www.geometrictools.com/
2. Other image processing techniques for your sample image
Although I could be wrong, it appears that GPUImage doesn't have many general-purpose image processing algorithms. Some other image processing algorithms could make this problem much simpler to solve.
Though there isn't space to go into it here, one of the keys to successful image processing is appropriate lighting. Make sure you're lighting is consistent. A diffuse light that evenly illuminates the book and the background would work well. You can simplify the problem using funkier lighting: if you have four lights (or a special ring light), you can provide horizontal illumination from the top, bottom, left, and right that will cause the edges of the book to appear bright and other surfaces to appear dark.
http://www.benderassoc.com/mic/lighting/nerlite/Darkfield.htm
If you can use some other GPU libraries to do image processing, then one of the following techniques could work nicely:
Connected component labeling (a.k.a. finding blobs). It shouldn't be too hard to use either binary thresholding or a watershed algorithm to separate the white blob that is the book from the rest of the background. Once the blob for the book is identified, finding the corners is easier. (http://en.wikipedia.org/wiki/Connected-component_labeling) In OpenCV you can find the "contours."
Generate an list of edge points, then have four separate line-fitting tools search from top to bottom, right to left, bottom to top, and left to right to find the four strong (and mostly straight) edges associated with the book. In your sample image, though, either the book cover is slightly warped or the camera lens has introduced barrel distortion.
Use a corner detector designed to find light corners on a dark background. If you will always be looking for a white book on a wood grain background, you can create a detector to find white corners on a brown background.
Use a Hough technique to find the four strongest lines in the image. (http://en.wikipedia.org/wiki/Hough_transform)
The algorithmic technique that works best will depend on your constraints: are you looking for rectangles only of a certain size? is the contrast between foreground and background consistent? can you introduce lighting to simplify the appearance of the image? and so on.

Correlating a vector with edges in an image

I'm trying to implement user-assisted edge detection using OpenCV.
Assume you have an image in which we need to find a polygonal shape. For the sake of discussion, let's say we need to find the top of a rectangular table in a picture. The user will click on the four corners of the table to help us narrow things down. Connecting those four points gives us a polygon, or four vectors.
But the user is not very accurate when clicking on those corners. So I'd like to use edge information from the image to increase the accuracy.
I'm using a Canny edge detector with a fairly high treshold to determine important edges in my image. (more precisely, I'm scaling down, blurring, converting to grayscale, then run Canny). How can I compute whether a vector aligns with an edge in my image? If I have a way to compute "alignment", my overal algorithm comes down to perturbating the location of the four edge points, computing the total "alignment" of my polygon with the edges in the image, until I find an optimum.
What is a good way to define and compute this "alignment" metric?
You may want to try to use FindContours to detect your table or any other contour. Then build a contour also from the user input points. After this you can read about Contour Moments by which you can compare contours. You can compare all the contours from the image with the one built from the user points and then select the closest match.

Resources