Machine learning vector association mutually exclusive - machine-learning

I have two sets of vectors representing points in three dimensional space. The first set is strictly smaller than second one. So I pick up all the vectors in the first set and I want to associate each one of them to just one vector from the second set. The association is binjective. So each point from the first set is associated to just one point in the second set, and the points in second set receive just one association each.
My question here is for which kind of algorithm should I use in order to succeed in such a task. I thought of several different approaches but couldn't find the right one.
At first I thought of using cluster assignment . But then I realised that using such an approach doesn't prevent the associations from being multiple ones. clusters might receive eventually more than one point from the first set which is specifically what I don't want to do.
The idea is that I need such an association so that the very shape depicted from the Vectors from the first set resembles the most the overall picture depicted from the Vectors in the second set. It's like as if the vectors from the first set has been subjected to some sort of 3D transformation into the Vectors in the second set but that for some reasons that transformation is recognisable.

Related

Fast way to find corresponding objects across stereo views

Thanks for taking your time to read this.
We have fixed stereo pairs of cameras looking into a closed volume. We know the dimensions of the volume and have the intrinsic and extrinsic calibration values
for the camera pairs. The objective being to be able to identify the 3d positions of multiple duplicate objects accurately.
Which naturally leads to what is described as the correspondence problem in litrature. We need a fast technique to match ball A from image 1 with Ball A from image 2 and so on.
At the moment we use the properties of epipolar geomentry (Fundamental matrix) to match the balls from different views in a crude way and works ok when the objects are sparse,
but gives a lot of false positives if the objects are densely scattered. Since ball A in image 1 can lie anywhere on the epipolar line going across image 2, it leads to mismatches
when multiple objects lie on that line and look similar.
Is there a way to re-model this into a 3d line intersection problem or something? Since the ball A in image 1 can only take a bounded limit of 3d values, Is there a way to represent
it as a line in 3d? and do a intersection test to find the closest matching ball in image 2?
Or is there a way to generate a sparse list of 3d values which correspond to each 2d grid of pixels in image 1 and 2, and do a intersection test
of these values to find the matching objects across two cameras?
Because the objects can be identical, OpenCV feature matching algorithms like FLANN, ORB doesn't work.
Any ideas in the form of formulae or code is welcome.
Thanks!
Sak
You've set yourself quite a difficult task. Because one point can occlude another in a view, it's not generally possible even to count the number of points. If each view has two points, but those points fall on the same epipolar line on the other view, then you can count anywhere between 2 and 4 points.
Assuming you want to minimize the points, this starts to look like Minimum Vertex Cover in a dense bipartite graph, with each edge representing the association of a point from each view, and the weight of each edge taken from the registration error of associating the corresponding points (vertices) from each view. MVC is, of course, NP-hard, and if you treat the problem as a general MVC problem then you'll never do better than O(n^2) because that's how many edges there are to examine.
Your particular MVC problem might have structure that can be exploited to perform a more efficient approximation. In particular, I might suggest calculating the epipolar lines in one view, ordering them by angle from the epipole, and similarly sorting the points in that view from the epipole. You can then iterate over the two sorted lists roughly in parallel, greedily associating each point with a nearby epipolar line. Then you can do the same in the other view, but only looking at points in that view which had not yet been associated during the previous pass. I think that a more regimented and provably optimal approach might be possible with dynamic programming (particularly if you strictly bound the registration error) which wouldn't require the second pass, but I can't sketch it out offhand.
For different types of objects it's easy- to find the match using sum-of-absolute-differences. For similar objects, the idea(s) could lead to publish a good paper. Anyway here's one quick algorithm:
detect the two balls in first image (using object detection methods).
divide the image into two segments cantaining two balls.
repeat steps 1 & 2 for second image also.
the direction of segments in two images should give correspondence of the two balls.
Try this, it should work for two balls.

Faster-RCNN, why don't we just use only RPN for detection?

As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.
My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?
Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?
Short answer: no they are not redundant.
The R-CNN article and its variants popularized the use of what we used to call a cascade.
Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.
If the detections are partly orthogonal it allows to remove false positive along the way.
Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).
But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see e.g. this article)
PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow
If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.
However, the second stage is used mainly to obtain more accurate detection boxes.
Faster-RCNN is a two-stage detector, like Fast R-CNN.
There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.
Now why is this necessary for the RPN? So why are they only rough estimates?
One reason is the limited receptive field:
The input image is transformed via a CNN into a feature map with limited spatial resolution. For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.
The box regression is done based on the final feature map of the CNN. In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.
Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person. Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.
Therefore, RPN creates a rough estimate of the bounding box. The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate.
In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight. This however can be done much more accurate, since more content of the object is accessable for the network.
faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.
So yes, you can, but your performance is not good enough
I think the blue circle is completely redundant and just adding a class classification layer (gives class for each bounding box containing object) should work just fine and that's what the single shot detectors do with compromised accuracy.
According to my understanding, RPN is just for binary checking if you have Objects in the bbox or not and final Detector part is for classifying the classes ex) car, human, phones, etc

RANSAC with multiple lines to be detected

This is a bit of a theoretical question, but I was wondering how one randomly chooses points when there are multiple lines to be detected in an image. In most examples I have seen so far, there seems to be only one line to be detected, and it seems easy. However, I am not sure how it is extended to detect multiple lines with a lot more points.
I think you are operating under a basic misunderstanding. RANSAC is just an algorithm to robustly partition some data points into two classes: those that are well predicted by a given parametric model, and those that aren't. The property of being "well-predicted" is expressed in terms of a loss function ("error") that depends on both the model parameters and the data points.
Reread the above paragraph, then ask yourself: do I have a parametric model expressing a collection of lines? If yes, go ahead and fit it. However, if your model can only handle single lines, you should first segment your dataset into portions that are themselves likely to belong to one line each, and then apply RANSAC to each portion.
In some (easy) cases one can proceed iteratively: first use RANSAC on a one-line model to find a large segment of data that fits one line, remove its segment from the dataset, and iterate on the remaining points.
RANSAC only works well when you want to detect a single inlier model, as Francesco Callari correctly explained. Of course the simple solution would be to use the a "sequential" RANSAC but that does only really work if your lines are mutually exclusive and or can be well constrained, such that RANSAC does really only fit one line instead of spanning multiple ones in an non-optimal manner. To solve this issues there exists a variety of approaches, for instance energy based geometric fitting approaches or using evolutionary dynamics to iteratively determine good candidates, to name two.
Here is a nice introduction to the problem from David F. Fouhey.

Evaluating the confidence of an image registration process

Background:
Assuming there are two shots for the same scene from two different perspective. Applying a registration algorithm on them will result in Homography Matrix that represents the relation between them. By warping one of them using this Homography Matrix will (theoretically) result in two identical images (if the non-shared area is ignored).
Since no perfection is exist, the two images may not be absolutely identical, we may find some differences between them and this differences can be shown obviously while subtracting them.
Example:
Furthermore, the lighting condition may results in huge difference while subtracting.
Problem:
I am looking for a metric that I can evaluate the accuracy of the registration process. This metric should be:
Normalized: 0->1 measurement which does not relate to the image type (natural scene, text, human...). For example, if two totally different registration process on totally different pair of photos have the same confidence, let us say 0.5, this means that the same good (or bad) registeration happened. This should applied even one of the pair is for very details-reach photos and the other of white background with "Hello" in black written.
Distinguishing between miss-registration accuracy and different lighting conditions: Although there is many way to eliminate this difference and make the two images look approximately the same, I am looking of measurement that does not count them rather than fixing them (performance issue).
One of the first thing that came in mind is to sum the absolute differences of the two images. However, this will result in a number that represent the error. This number has no meaning when you want to compare it to another registration process because another images with better registration but more details may give a bigger error rather than a smaller one.
Sorry for the long post. I am glad to provide any further information and collaborating in finding the solution.
P.S. Using OpenCV is acceptable and preferable.
You can always use invariant (lighting/scale/rotation) features in both images. For example SIFT features.
When you match these using typical ratio (between nearest and next nearest), you'll have a large set of matches. You can calculate the homography using your method, or using RANSAC on these matches.
In any case, for any homography candidate, you can calculate the number of feature matches (out of all), which agree with the model.
The number divided by the total matches number gives you a metric of 0-1 as to the quality of the model.
If you use RANSAC using the matches to calculate the homography, the quality metric is already built in.
This problem is given two images decide how misaligned they are.
Thats why we did the registration. The registration approach cannot answer itself how bad a job it did becasue if it knew it it would have done it.
Only in the absolute correct case do we know the result: 0
You want a deterministic answer? you add deterministic input.
a red square in a given fixed position which can be measured how rotated - translated-scaled it is. In the conditions of lab this can be achieved.

Dimension Reduction of Feature in Machine Learning

Is there any way to reduce the dimension of the following features from 2D coordinate (x,y) to one dimension?
Yes. In fact, there are infinitely many ways to reduce the dimension of the features. It's by no means clear, however, how they perform in practice.
A feature reduction usually is done via a principal component analysis (PCA) which involves a singular value decomposition. It finds the directions with highest variance -- that is, those direction in which "something is going on".
In your case, a PCA might find the black line as one of the two principal components:
The projection of your data onto this one-dimensional subspace than yields the reduced form of your data.
Already with the eye one can see that on this line the three feature sets can be separated -- I coloured the three ranges accordingly. For your example, it is even possible to completely separate the data sets. A new data point then would be classified according to the range in which its projection onto the black line lies (or, more generally, the projection onto the principal component subspace) lies.
Formally, one could obtain a division with further methods that use the PCA-reduced data as input, such as for example clustering methods or a K-nearest neighbour model.
So, yes, in case of your example it could be possible to make such a strong reduction from 2D to 1D, and, at the same time, even obtain a reasonable model.

Resources