Thanks for taking your time to read this.
We have fixed stereo pairs of cameras looking into a closed volume. We know the dimensions of the volume and have the intrinsic and extrinsic calibration values
for the camera pairs. The objective being to be able to identify the 3d positions of multiple duplicate objects accurately.
Which naturally leads to what is described as the correspondence problem in litrature. We need a fast technique to match ball A from image 1 with Ball A from image 2 and so on.
At the moment we use the properties of epipolar geomentry (Fundamental matrix) to match the balls from different views in a crude way and works ok when the objects are sparse,
but gives a lot of false positives if the objects are densely scattered. Since ball A in image 1 can lie anywhere on the epipolar line going across image 2, it leads to mismatches
when multiple objects lie on that line and look similar.
Is there a way to re-model this into a 3d line intersection problem or something? Since the ball A in image 1 can only take a bounded limit of 3d values, Is there a way to represent
it as a line in 3d? and do a intersection test to find the closest matching ball in image 2?
Or is there a way to generate a sparse list of 3d values which correspond to each 2d grid of pixels in image 1 and 2, and do a intersection test
of these values to find the matching objects across two cameras?
Because the objects can be identical, OpenCV feature matching algorithms like FLANN, ORB doesn't work.
Any ideas in the form of formulae or code is welcome.
Thanks!
Sak
You've set yourself quite a difficult task. Because one point can occlude another in a view, it's not generally possible even to count the number of points. If each view has two points, but those points fall on the same epipolar line on the other view, then you can count anywhere between 2 and 4 points.
Assuming you want to minimize the points, this starts to look like Minimum Vertex Cover in a dense bipartite graph, with each edge representing the association of a point from each view, and the weight of each edge taken from the registration error of associating the corresponding points (vertices) from each view. MVC is, of course, NP-hard, and if you treat the problem as a general MVC problem then you'll never do better than O(n^2) because that's how many edges there are to examine.
Your particular MVC problem might have structure that can be exploited to perform a more efficient approximation. In particular, I might suggest calculating the epipolar lines in one view, ordering them by angle from the epipole, and similarly sorting the points in that view from the epipole. You can then iterate over the two sorted lists roughly in parallel, greedily associating each point with a nearby epipolar line. Then you can do the same in the other view, but only looking at points in that view which had not yet been associated during the previous pass. I think that a more regimented and provably optimal approach might be possible with dynamic programming (particularly if you strictly bound the registration error) which wouldn't require the second pass, but I can't sketch it out offhand.
For different types of objects it's easy- to find the match using sum-of-absolute-differences. For similar objects, the idea(s) could lead to publish a good paper. Anyway here's one quick algorithm:
detect the two balls in first image (using object detection methods).
divide the image into two segments cantaining two balls.
repeat steps 1 & 2 for second image also.
the direction of segments in two images should give correspondence of the two balls.
Try this, it should work for two balls.
Related
I am currently working on program which could help at my work. I'm trying to use Machine Learning for the classification purpose. The problem is that I don't have enough samples for training the model and augmentation is something I'm trying to avoid because hardware problems (not enough RAM) either on my company laptop and on the Google Collab. So I decided to try to somehow normalize the position of the elements so the differences would be visible for the machine even with no big amount of different samples. Unfortunately now I'm struggling how to normalize those pictures.
Element 1a:
Element 1b:
Element 2a:
Element 2b:
Elements 1a and 1b are the same type and 2a - 2b are the same type. Is there a way to somehow normalize position for those pictures (something like position 0) which would help the algorithm to see differences between them? I've tried using cv2.minAreaSquare to get the square position, rotating them and cropping don't needed area but unfortunately those elements can have different width so after scaling them down the contours are deformed unevenly. Then I was trying to get symmetry axis and using this to do a proper cropping after rotation but still the results didn't meet my expectations. I was thinking to add more normalization points like this:
Normalization Points:
And using this points normalize position of the rest of my elements but Perspective Transform takes only 4 points and with 4 points its also not very good methodology. Maybe you guys know a way how to move those elements to have them in the same positions.
Seeing the images, I believe that the transformation between two pictures is either an isometry (translation + rotation) or a similarity (translation + rotation + scaling). These can be determined with just two points. (Perspective takes four points but I think that this is overkill.)
But for good accuracy, you must make sure that the points are found reliably and precisely. In the first place, you need to guess which features of the shapes are repeatable from one sample to the next.
For example, you might estimate that the straight edges are always in the same relative position. In such a case, I would recommend finding two points on some edges, drawing a line between them and find intersections between the lines.
In the illustration, you find edge points along the red profiles, and from them you draw the green lines. They intersect in the yellow points.
For increased accuracy, you can use a least-squares approach to find a best fit on more than two points.
I have two images taken from slightly different positions. After performing image segmentation on these, I would like to match the segmentation masks of the two images so that I can identify the same objects in both. Most of the time the sizes or orientations of objects change:
Sometimes objects that appear in one image do not appear in the other, for example:
I've tried naively matching individual objects using their centroid positions and sizes, but it's very error prone. What's the best way to do this using, e.g. OpenCV?
You could try a RANSAC-like approach. The transformation is not much skewed so you could just try a similarity transformation (translation + rotation + scaling). This only takes two corresponding point pairs. From two matched pairs, you compute the transformation, transform the other points and see how well they match.
As your number of points is small, an exhaustive search with all possible pairs (N(N-1)/2 of them) is not excessive. (And it seems that you can sort the points left to right without introducing inversions, which will reduce the number of possibilities.)
We have this camera array arranged in an arc around a person (red dot). Think The Matrix - each camera fires at the same time and then we create an animated gif from the output. The problem is that it is near impossible to align the cameras exactly and so I am looking for a way in OpenCV to align the images better and make it smoother.
Looking for general steps. I'm unsure of the order I would do it. If I start with image 1 and match 2 to it, then 2 is further from three than it was at the start. And so matching 3 to 2 would be more change... and the error would propagate. I have seen similar alignments done though. Any help much appreciated.
Here's a thought. How about performing a quick and very simple "calibration" of the imaging system by using a single reference point?
The best thing about this is you can try it out pretty quickly and even if results are too bad for you, they can give you some more insight into the problem. But the bad thing is it may just not be good enough because it's hard to think of anything "less advanced" than this. Here's the description:
Remove the object from the scene
Place a small object (let's call it a "dot") to position that rougly corresponds to center of mass of object you are about to record (the center of area denoted by red circle).
Record a single image with each camera
Use some simple algorithm to find the position of the dot on every image
Compute distances from dot positions to image centers on every image
Shift images by (-x, -y), where (x, y) is the above mentioned distance; after that, the dot should be located in the center of every image.
When recording an actual object, use these precomputed distances to shift all images. After you translate the images, they will be roughly aligned. But since you are shooting an object that is three-dimensional and has considerable size, I am not sure whether the alignment will be very convincing ... I wonder what results you'd get, actually.
If I understand the application correctly, you should be able to obtain the relative pose of each camera in your array using homographies:
https://docs.opencv.org/3.4.0/d9/dab/tutorial_homography.html
From here, the next step would be to correct for alignment issues by estimating the transform between each camera's actual position and their 'ideal' position in the array. These ideal positions could be computed relative to a single camera, or relative to the focus point of the array (which may help simplify calculation). For each image, applying this corrective transform will result in an image that 'looks like' it was taken from the 'ideal' position.
Note that you may need to estimate relative camera pose in 3-4 array 'sections', as it looks like you have a full 180deg array (e.g. estimate homographies for 4-5 cameras at a time). As long as you have some overlap between sections it should work out.
Most of my experience with this sort of thing comes from using MATLAB's stereo camera calibrator app and related functions. Their help page gives a good overview of how to get started estimating camera pose. OpenCV has similar functionality.
https://www.mathworks.com/help/vision/ug/stereo-camera-calibrator-app.html
The cited paper by Zhang gives a great description of the mathematics of pose estimation from correspondence, if you're interested.
Are there functions within OpenCV that will 'track' a gradually changing curve without following sharply divergent crossing lines? Ex: If one were attempting to track individual outlines of two crossed boomerangs, is there an easy way to follow the curved line 'through' the intersection where the two boomerangs cross?
This would require some kind of inertial component that would continue a 'virtual' line when the curve was interrupted by the other crossed boomerang, and then find the continuation of the original line on the opposite side.
This seems simple, but it sounds so complicated when trying to explain it. :-) It does seem like a scenario that would occur often (attempting to trace an occluded object). Perhaps part of a third party library or specialized project?
I believe I have found an approach to this. OpenCV's approxPolyDP finds polygons to approximate the contour. It is relatively easy to track angles between the polygon's sides (as opposed to finding continuous tangents to curves). When an 'internal' angle is found where the two objects meet, it should be possible to match with a corresponding internal angle on the opposite side.
Ex: When two bananas/boomerangs/whatever overlap, the outline will form a sort of cross, with four points and four 'internal angles' (> 180 degrees). It should be possible to match the coordinates of the four internal angles. If their corresponding lines (last known trajectory before overlap) are close enough to parallel, then that indicates overlapping objects rather than one more complex shape.
approxPolyDP simplifies this to geometry and trig. This should be a much easier solution than what I had previously envisioned with continuous bezier curves and inertia. I should have thought of this earlier.
I want to design an algorithm that would find matches in images of the same apartment, when put up by different real estate agents.
Photos are relatively taken in similar time so the interior of the rooms should not change that much but of course every guys takes different pictures from different angles, etc.
(TLDR; a apartment goes for sale, and different real estate guys come in and make their own pictures, and I want to know if the given pictures from various guys are of the same place)
I know that image processing and recognition algorithm selections highly depend on the use case, so could you point me in correct direction given my use-case?
http://reality.bazos.sk/inzerat/56232813/Prenajom-1-izb-bytu-v-sirsom-centre.php
http://reality.bazos.sk/inzerat/56371292/-PRENAJOM-krasny-1i-byt-rekonstr-Kupeckeho-Ruzinov-BA-II.php
You can actually use Clarifai's Custom Training API endpoint, fairly simple and straightforward. All you would have to do is train the initial image and then compare the second to it. If the probability is high, it is likely the same apartment. For example:
In javascript, to declare a positive it is:
clarifai.positive('http://example.com/apartment1.jpg', 'firstapartment', callback);
And a negative is:
clarifai.negative('http://example.com/notapartment1.jpg', 'firstapartment', callback);
You don't necessarily have to do a negative, but it could only help. Then, when you are comparing images to the first aparment, you do:
clarifai.predict('http://example.com/someotherapartment.jpg', 'firstapartment', callback);
This will give you a probability regarding the likeness of the photo to what you've trained ('firstapartment'). This API is basically doing machine learning without the hassle of the actual machine. Clarifai's API also has a tagging input that is extremely accurate with some basic tags. The API is free for a certain number of calls/month. Definitely worth it to check out for this case.
As user Shaked mentioned in a comment, this is a difficult problem. Even if you knew the position and orientation of each camera in space, and also the characteristics of each camera, it wouldn't be a trivial problem to match the images.
A "bag of words" (BoW) approach may be of use here. Rather than try to identify specific objects and/or deduce the original 3D scene, you determine what "feature descriptors" can distinguish objects from one another in your image sets.
https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision
Imagine you could describe the two images by the relative locations of textures and colors:
horizontal-ish line segments at far left
red blob near center left
green clumpy thing at bottom left
bright round object near top left
...
then for a reasonably constrained set of images (e.g. photos just within a certain zip code), you may be able to yield a good match between the two images above.
The Wikipedia article on BoW may look a bit daunting, but I think if you hunt around you'll find an article that describes "bag of words" for image processing clearly. I've seen a very good demo of a BoW approach used to identify objects such as boats and delivery vans in arbitrary video streams, and it worked impressively well. I wish I had a copy of the presentation to pass along.
If you don't suspect the image to change much, you could try the standard first step of any standard structure-from-motion algorithm to establish a notion of similarity between a pair of images. Any pair of images are similar if they contain a number of matching image features larger than a threshold which satisfy the geometrical constraint of the scene as well. For a general scene, that geometrical constraint is given by a Fundamental Matrix F computed using a subset of matching features.
Here are the steps. I have inserted the opencv method for each step, but you could write your methods too:
Read the pair of images. Use img = cv2.imread(filename).
Use SIFT/SURF to detect image features/descriptors in both images.
sift = cv2.xfeatures2d.SIFT_create()
kp, des = sift.detectAndCompute(img,None)
Match features using the descriptors.
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1,des2)
Use RANSAC to compute funamental matrix.
cv2.findFundamentalMatrix(pts1, pts2, cv2.FM_RANSAC, 3, 0.99, mask)
mask contains all the inliers. Simply count them to determine if the number of matches satisfying geometrical constraint is large enough.
CAUTION: In case of a planar scene, we use homography instead of a fundamental matrix and the steps described above work out pretty nicely because homography takes a point to a corresponding point in the other image. However, Fundamental matrix takes a point to the corresponding epipolar line in the other image, which makes the entire process a bit less stable. So I would recommend trying these steps a few more times with a little bit of jitter to the feature locations and collating the evidence over more than one trial to make the decision. You can also use more advanced steps to introduce robustness to this process but only if the steps described above don't yield the results you need.