How to recognize a real square in an image - opencv

I am using OpenCV for square detection in an image. The squares.c example is a really great help, but my problem is that it recognizes pretty much everything that has 4 corners that are close to 90 degrees.
My goal is it however to only recognize the real squares in an image from a video feed. This means the objects themselves have 4 edges with the same length and 4, 90 degree angles. This sounds rather easy at first, but since an object might be tilted in the image, the angles can vary between something like 45 and 135 degrees and the edges have a different length. If I check these attributes though I am still recognizing rectangles that are no squares.
I've been thinking for a good solution to only recognize real squares for a few days now, but everything I come up with is still flawed. I wonder if any of you knows what the exact relation between the angles of the corners and the edge length is. With my guessing so far I have come pretty far, but sometimes random squares pop up that I don't want to recognize. I really think there is some mathematical relation, but I can't really find a formula for squares in a perspective view.
Any help would really be appreciated!

Without any reference coordinate system, how is this even possible? If you are doing the recognition based on a video feed, can your be "taught" what a square looks like by keeping a square in the field of view at all times? Maybe then you can use this to figure out what the rotations are in 3 space, which you'd then have to apply to everything else in the feed.

Related

What's the right way to handle the artifacts from TrueDepth camera buffer near the edges?

I am following the guide from https://developer.apple.com/documentation/avfoundation/cameras_and_media_capture/streaming_depth_data_from_the_truedepth_camera. It uses AVCaptureDepthDataOutput() to stream depth data from the front camera.
But, two of the edges have wrong data. In portrait mode it would be upper and right edge. In landscape rotated counter-clockwise, it would be left and upper side.
Here's an example of how it looks. This is from the tutorial, but I modified a code a little. Note I did print out the actual distance as a number to make sure it wasn't a bug with the video/colors rendering.
https://drive.google.com/file/d/1AGFHAZypmHz9136T02ufedz-X3Ct94Kq/view?usp=sharing
I'm okay with this if it's just a necessary artifact of translating from 3D to 2D data. But what I'm looking for is the "correct" way to crop the image so that the 2D depth buffer doesn't have these artifacts at the edges. How do I know how far in to crop? Do I just experiment with it? If so, how would I know it works the same across all devices?
Another issue is that if an object gets too close to the camera, the distance becomes interpreted as the maximum distance instead of minimum. Is this expected?
Turn off "Smooth depth" and the buffer will be much more clearly defined and consistent (e.g. less than 50 pixels). It would be safe to assume that any value near the edge is not accurate, so inset by some amount similar to 50, you should experiment to see what works best for you.
As far as near by values being reported as the maximum, you should be checking for float.isNan or float.isInfinite and filter those values out. There is a minimum range in which objects closer than that distance can not be detected.

OpenCV - align stack of images - different cameras

We have this camera array arranged in an arc around a person (red dot). Think The Matrix - each camera fires at the same time and then we create an animated gif from the output. The problem is that it is near impossible to align the cameras exactly and so I am looking for a way in OpenCV to align the images better and make it smoother.
Looking for general steps. I'm unsure of the order I would do it. If I start with image 1 and match 2 to it, then 2 is further from three than it was at the start. And so matching 3 to 2 would be more change... and the error would propagate. I have seen similar alignments done though. Any help much appreciated.
Here's a thought. How about performing a quick and very simple "calibration" of the imaging system by using a single reference point?
The best thing about this is you can try it out pretty quickly and even if results are too bad for you, they can give you some more insight into the problem. But the bad thing is it may just not be good enough because it's hard to think of anything "less advanced" than this. Here's the description:
Remove the object from the scene
Place a small object (let's call it a "dot") to position that rougly corresponds to center of mass of object you are about to record (the center of area denoted by red circle).
Record a single image with each camera
Use some simple algorithm to find the position of the dot on every image
Compute distances from dot positions to image centers on every image
Shift images by (-x, -y), where (x, y) is the above mentioned distance; after that, the dot should be located in the center of every image.
When recording an actual object, use these precomputed distances to shift all images. After you translate the images, they will be roughly aligned. But since you are shooting an object that is three-dimensional and has considerable size, I am not sure whether the alignment will be very convincing ... I wonder what results you'd get, actually.
If I understand the application correctly, you should be able to obtain the relative pose of each camera in your array using homographies:
https://docs.opencv.org/3.4.0/d9/dab/tutorial_homography.html
From here, the next step would be to correct for alignment issues by estimating the transform between each camera's actual position and their 'ideal' position in the array. These ideal positions could be computed relative to a single camera, or relative to the focus point of the array (which may help simplify calculation). For each image, applying this corrective transform will result in an image that 'looks like' it was taken from the 'ideal' position.
Note that you may need to estimate relative camera pose in 3-4 array 'sections', as it looks like you have a full 180deg array (e.g. estimate homographies for 4-5 cameras at a time). As long as you have some overlap between sections it should work out.
Most of my experience with this sort of thing comes from using MATLAB's stereo camera calibrator app and related functions. Their help page gives a good overview of how to get started estimating camera pose. OpenCV has similar functionality.
https://www.mathworks.com/help/vision/ug/stereo-camera-calibrator-app.html
The cited paper by Zhang gives a great description of the mathematics of pose estimation from correspondence, if you're interested.

How can I detect whether the object is 3D?

I am trying to build a solution where I could differentiate between a 3D textured surface with the height of around 200 micron and a regular text print.
The following image is a textured surface. The black color here is the base surface.
Regular text print will be the 2D print of the same 3D textured surface.
[EDIT]
Initial thought about solving this problem, could look like this:
General idea here would be, images shot at different angles of a 3D object would be less related to each other than the images shot for a 2D object in the similar condition.
One of the possible way to verify could be: 1. Take 2 images, with enough light around (flash of the camera). These images should be shot at as far angle from the object plane as possible. Say, one taken at camera making 45 degree at left side and other with the same angle on the right side.
Extract the ROI, perspective correct them.
Find GLCM of the composite of these 2 images. If the contrast of the GLCM is low, then it would be a 3D image, else a 2D.
Please pardon the language, open for edit suggestion.
General idea here would be, images shot at different angles of a 3D object would be less related to each other than the images shot for a 2D object in the similar condition.
One of the possible way to verify could be:
1. Take 2 images, with enough light around (flash of the camera). These images should be shot at as far angle from the object plane as possible. Say, one taken at camera making 45 degree at left side and other with the same angle on the right side.
Extract the ROI, perspective correct them.
Find GLCM of composite of these 2 images. If contrast of the GLCM is low, then it would be a 3D image, else a 2D.
Please pardon the language, open for edit suggestion.
If you can get another image which
different angle or
sharper angle or
different lighting condition
you may get result. However, using two image with different angle with calibrate camera can get stereo vision image which solve your problem easily.
This is a pretty complex problem and there is no plug-in-and-go solution for this. Using light (structured or laser) or shadow to detect a height of 0.2 mm will almost surely not work with an acceptable degree of confidence, no matter of how much "photos" you take. (This is just my personal intuition, in computer vision we verify if something works by actually testing).
GLCM is a nice feature to describe texture, but it is, as far as I know, used to verify if there is a pattern in the texture, so, I believe it would output a positive value for 2D print text if there is some kind of repeating pattern.
I would let the computer learn what is text, what is texture. Just extract a large amount of 3D and 2D data, and use a machine learning engine to learn which is what. If the feature space is rich enough, it may be able to find a way to differentiate one from another, in a way our human mind wouldn't be able to. The feature space should consist of edge and colour features.
If the system environment is stable and controlled, this approach will work specially well, since the training data will be so similar to the testing data.
For this problem, I'd start by computing colour and edge features (local image pixel sums over different edge and colour channels) and try a boosted classifier. Boosted classifiers aren't the state of the art when it comes to machine learning, but they are good at not overfitting (meaning you can just insert as much data as you want), and will most likely work in a stable environment.
Hope this helps,
Good luck.

Shape/Pattern Matching Approach in Computer Vision

I am currently facing a, in my opinion, rather common problem which should be quite easy to solve but so far all my approached have failed so I am turning to you for help.
I think the problem is explained best with some illustrations. I have some Patterns like these two:
I also have an Image like (probably better, because the photo this one originated from was quite poorly lit) this:
(Note how the Template was scaled to kinda fit the size of the image)
The ultimate goal is a tool which determines whether the user shows a thumb up/thumbs down gesture and also some angles in between. So I want to match the patterns against the image and see which one resembles the picture the most (or to be more precise, the angle the hand is showing). I know the direction in which the thumb is showing in the pattern, so if i find the pattern which looks identical I also have the angle.
I am working with OpenCV (with Python Bindings) and already tried cvMatchTemplate and MatchShapes but so far its not really working reliably.
I can only guess why MatchTemplate failed but I think that a smaller pattern with a smaller white are fits fully into the white area of a picture thus creating the best matching factor although its obvious that they dont really look the same.
Are there some Methods hidden in OpenCV I havent found yet or is there a known algorithm for those kinds of problem I should reimplement?
Happy New Year.
A few simple techniques could work:
After binarization and segmentation, find Feret's diameter of the blob (a.k.a. the farthest distance between points, or the major axis).
Find the convex hull of the point set, flood fill it, and treat it as a connected region. Subtract the original image with the thumb. The difference will be the area between the thumb and fist, and the position of that area relative to the center of mass should give you an indication of rotation.
Use a watershed algorithm on the distances of each point to the blob edge. This can help identify the connected thin region (the thumb).
Fit the largest circle (or largest inscribed polygon) within the blob. Dilate this circle or polygon until some fraction of its edge overlaps the background. Subtract this dilated figure from the original image; only the thumb will remain.
If the size of the hand is consistent (or relatively consistent), then you could also perform N morphological erode operations until the thumb disappears, then N dilate operations to grow the fist back to its original approximate size. Subtract this fist-only blob from the original blob to get the thumb blob. Then uses the thumb blob direction (Feret's diameter) and/or center of mass relative to the fist blob center of mass to determine direction.
Techniques to find critical points (regions of strong direction change) are trickier. At the simplest, you might also use corner detectors and then check the distance from one corner to another to identify the place when the inner edge of the thumb meets the fist.
For more complex methods, look into papers about shape decomposition by authors such as Kimia, Siddiqi, and Xiaofing Mi.
MatchTemplate seems like a good fit for the problem you describe. In what way is it failing for you? If you are actually masking the thumbs-up/thumbs-down/thumbs-in-between signs as nicely as you show in your sample image then you have already done the most difficult part.
MatchTemplate does not include rotation and scaling in the search space, so you should generate more templates from your reference image at all rotations you'd like to detect, and you should scale your templates to match the general size of the found thumbs up/thumbs down signs.
[edit]
The result array for MatchTemplate contains an integer value that specifies how well the fit of template in image is at that location. If you use CV_TM_SQDIFF then the lowest value in the result array is the location of best fit, if you use CV_TM_CCORR or CV_TM_CCOEFF then it is the highest value. If your scaled and rotated template images all have the same number of white pixels then you can compare the value of best fit you find for all different template images, and the template image that has the best fit overall is the one you want to select.
There are tons of rotation/scaling independent detection functions that could conceivably help you, but normalizing your problem to work with MatchTemplate is by far the easiest.
For the more advanced stuff, check out SIFT, Haar feature based classifiers, or one of the others available in OpenCV
I think you can get excellent results if you just compute the two points that have the furthest shortest path going through white. The direction in which the thumb is pointing is just the direction of the line that joins the two points.
You can do this easily by sampling points on the white area and using Floyd-Warshall.

Sprite pixel parsing to determine Vector

Given an image that can contain any variety of solid color images, what is the best method for parsing the image at a given point and then determining the slope (or Vector if you prefer) of that area?
Being new to XNA development, I feel there must be an established method for doing this sort of thing but I have Googled this issue for awhile now.
By way of example, I have mocked up a quick image to demonstrate what I am trying to do. The white portion of the image (where the labels are shown) would be transparent pixels. The "ground" would be a RenderTarget2D or Texture2D object that will provide the Color array of pixels.
Example
What you are looking for is the tangent, which is 90 degrees to the normal (which is more commonly used). These two terms should assist you in your searching.
This is trivial if you've got the polygon outline data. If all you have is an image, then you have to come up with a way to convert it into a polygon.
It may not be entirely suitable for your problem, but the first place I would go is the Farseer Physics Engine, which has a "texture to polygon" feature you could possibly reuse.
If you are using the terrain as some kind of "ground", you can possibly cheat a bit by looking at the adjacent column of pixels and using that to determine the ground slope at that exact point. Kind of like what Lemmings and Worms do.
If you make that determination at the boundary between each pixel, you can get gradients of rise:run between two pixels horizontally. Usually you just break it into categories: so flat (1:1), 45 degrees (2:1) or too steep (>3:1). With a more complicated algorithm, that looks outwards to more columns, you can get better resolution.

Resources