RANSAC with multiple lines to be detected - image-processing

This is a bit of a theoretical question, but I was wondering how one randomly chooses points when there are multiple lines to be detected in an image. In most examples I have seen so far, there seems to be only one line to be detected, and it seems easy. However, I am not sure how it is extended to detect multiple lines with a lot more points.

I think you are operating under a basic misunderstanding. RANSAC is just an algorithm to robustly partition some data points into two classes: those that are well predicted by a given parametric model, and those that aren't. The property of being "well-predicted" is expressed in terms of a loss function ("error") that depends on both the model parameters and the data points.
Reread the above paragraph, then ask yourself: do I have a parametric model expressing a collection of lines? If yes, go ahead and fit it. However, if your model can only handle single lines, you should first segment your dataset into portions that are themselves likely to belong to one line each, and then apply RANSAC to each portion.
In some (easy) cases one can proceed iteratively: first use RANSAC on a one-line model to find a large segment of data that fits one line, remove its segment from the dataset, and iterate on the remaining points.

RANSAC only works well when you want to detect a single inlier model, as Francesco Callari correctly explained. Of course the simple solution would be to use the a "sequential" RANSAC but that does only really work if your lines are mutually exclusive and or can be well constrained, such that RANSAC does really only fit one line instead of spanning multiple ones in an non-optimal manner. To solve this issues there exists a variety of approaches, for instance energy based geometric fitting approaches or using evolutionary dynamics to iteratively determine good candidates, to name two.
Here is a nice introduction to the problem from David F. Fouhey.

Related

Faster-RCNN, why don't we just use only RPN for detection?

As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.
My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?
Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?
Short answer: no they are not redundant.
The R-CNN article and its variants popularized the use of what we used to call a cascade.
Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.
If the detections are partly orthogonal it allows to remove false positive along the way.
Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).
But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see e.g. this article)
PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow
If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.
However, the second stage is used mainly to obtain more accurate detection boxes.
Faster-RCNN is a two-stage detector, like Fast R-CNN.
There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.
Now why is this necessary for the RPN? So why are they only rough estimates?
One reason is the limited receptive field:
The input image is transformed via a CNN into a feature map with limited spatial resolution. For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.
The box regression is done based on the final feature map of the CNN. In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.
Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person. Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.
Therefore, RPN creates a rough estimate of the bounding box. The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate.
In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight. This however can be done much more accurate, since more content of the object is accessable for the network.
faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.
So yes, you can, but your performance is not good enough
I think the blue circle is completely redundant and just adding a class classification layer (gives class for each bounding box containing object) should work just fine and that's what the single shot detectors do with compromised accuracy.
According to my understanding, RPN is just for binary checking if you have Objects in the bbox or not and final Detector part is for classifying the classes ex) car, human, phones, etc

Principal Component Analysis and Rotation

I have implemented a PCA in order to assign rotation information to connected 2D points extracted from images (edge fragments, see data points in image below for examples). I want the information to be robustly reproducible under rotation of the data so that I can use it for recognition purposes (comparable to 1). For this purpose, I want the principal components (eigenvectors) to rotate with the points (+- 180 deg).
My implementation includes a mean centring of the data. I have also tested the implementations of OpenCV and one in Python which yield to the same results. This is why I assume that my implementation is correct and that the problem is the method itself. I had quite good results for other 2D distributions. Nonetheless, for these specific data points, it does not seem to work.
I have done all the tests with and without normalization to the standard deviation (ie., dividing the data of the x and y values by their standard deviations).
Here are my results for different rotations of the data (extracted from images):
PCA Results
As can be seen, the method does not allow to find a reproducible rotation. The data is affected by quantization (because it is extracted from images) which is why I had the idea that this is the origin of the problem. Therefore I repeated the experiment with added random noise (4th column). As can be seen, this does not seem to be the problem.
I have no precise idea how to explain the displayed effects. I note that the general orientation of the principal axes seems to be similar in the first and second row, respectively. I think that this means something, but what exactly? Can I somehow solve the problem or are there possibly better methods for such a problem? Due to some preprocessing it can be assumed that there are no outliers.
Thanks for your help!
For symmetrycal shapes like you shown you can try symmetry detector like this: https://github.com/subokita/Sandbox/tree/master/FSD
On examples it give results like this:

Evaluating the confidence of an image registration process

Background:
Assuming there are two shots for the same scene from two different perspective. Applying a registration algorithm on them will result in Homography Matrix that represents the relation between them. By warping one of them using this Homography Matrix will (theoretically) result in two identical images (if the non-shared area is ignored).
Since no perfection is exist, the two images may not be absolutely identical, we may find some differences between them and this differences can be shown obviously while subtracting them.
Example:
Furthermore, the lighting condition may results in huge difference while subtracting.
Problem:
I am looking for a metric that I can evaluate the accuracy of the registration process. This metric should be:
Normalized: 0->1 measurement which does not relate to the image type (natural scene, text, human...). For example, if two totally different registration process on totally different pair of photos have the same confidence, let us say 0.5, this means that the same good (or bad) registeration happened. This should applied even one of the pair is for very details-reach photos and the other of white background with "Hello" in black written.
Distinguishing between miss-registration accuracy and different lighting conditions: Although there is many way to eliminate this difference and make the two images look approximately the same, I am looking of measurement that does not count them rather than fixing them (performance issue).
One of the first thing that came in mind is to sum the absolute differences of the two images. However, this will result in a number that represent the error. This number has no meaning when you want to compare it to another registration process because another images with better registration but more details may give a bigger error rather than a smaller one.
Sorry for the long post. I am glad to provide any further information and collaborating in finding the solution.
P.S. Using OpenCV is acceptable and preferable.
You can always use invariant (lighting/scale/rotation) features in both images. For example SIFT features.
When you match these using typical ratio (between nearest and next nearest), you'll have a large set of matches. You can calculate the homography using your method, or using RANSAC on these matches.
In any case, for any homography candidate, you can calculate the number of feature matches (out of all), which agree with the model.
The number divided by the total matches number gives you a metric of 0-1 as to the quality of the model.
If you use RANSAC using the matches to calculate the homography, the quality metric is already built in.
This problem is given two images decide how misaligned they are.
Thats why we did the registration. The registration approach cannot answer itself how bad a job it did becasue if it knew it it would have done it.
Only in the absolute correct case do we know the result: 0
You want a deterministic answer? you add deterministic input.
a red square in a given fixed position which can be measured how rotated - translated-scaled it is. In the conditions of lab this can be achieved.

Image similarity of apartment photos

I want to design an algorithm that would find matches in images of the same apartment, when put up by different real estate agents.
Photos are relatively taken in similar time so the interior of the rooms should not change that much but of course every guys takes different pictures from different angles, etc.
(TLDR; a apartment goes for sale, and different real estate guys come in and make their own pictures, and I want to know if the given pictures from various guys are of the same place)
I know that image processing and recognition algorithm selections highly depend on the use case, so could you point me in correct direction given my use-case?
http://reality.bazos.sk/inzerat/56232813/Prenajom-1-izb-bytu-v-sirsom-centre.php
http://reality.bazos.sk/inzerat/56371292/-PRENAJOM-krasny-1i-byt-rekonstr-Kupeckeho-Ruzinov-BA-II.php
You can actually use Clarifai's Custom Training API endpoint, fairly simple and straightforward. All you would have to do is train the initial image and then compare the second to it. If the probability is high, it is likely the same apartment. For example:
In javascript, to declare a positive it is:
clarifai.positive('http://example.com/apartment1.jpg', 'firstapartment', callback);
And a negative is:
clarifai.negative('http://example.com/notapartment1.jpg', 'firstapartment', callback);
You don't necessarily have to do a negative, but it could only help. Then, when you are comparing images to the first aparment, you do:
clarifai.predict('http://example.com/someotherapartment.jpg', 'firstapartment', callback);
This will give you a probability regarding the likeness of the photo to what you've trained ('firstapartment'). This API is basically doing machine learning without the hassle of the actual machine. Clarifai's API also has a tagging input that is extremely accurate with some basic tags. The API is free for a certain number of calls/month. Definitely worth it to check out for this case.
As user Shaked mentioned in a comment, this is a difficult problem. Even if you knew the position and orientation of each camera in space, and also the characteristics of each camera, it wouldn't be a trivial problem to match the images.
A "bag of words" (BoW) approach may be of use here. Rather than try to identify specific objects and/or deduce the original 3D scene, you determine what "feature descriptors" can distinguish objects from one another in your image sets.
https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision
Imagine you could describe the two images by the relative locations of textures and colors:
horizontal-ish line segments at far left
red blob near center left
green clumpy thing at bottom left
bright round object near top left
...
then for a reasonably constrained set of images (e.g. photos just within a certain zip code), you may be able to yield a good match between the two images above.
The Wikipedia article on BoW may look a bit daunting, but I think if you hunt around you'll find an article that describes "bag of words" for image processing clearly. I've seen a very good demo of a BoW approach used to identify objects such as boats and delivery vans in arbitrary video streams, and it worked impressively well. I wish I had a copy of the presentation to pass along.
If you don't suspect the image to change much, you could try the standard first step of any standard structure-from-motion algorithm to establish a notion of similarity between a pair of images. Any pair of images are similar if they contain a number of matching image features larger than a threshold which satisfy the geometrical constraint of the scene as well. For a general scene, that geometrical constraint is given by a Fundamental Matrix F computed using a subset of matching features.
Here are the steps. I have inserted the opencv method for each step, but you could write your methods too:
Read the pair of images. Use img = cv2.imread(filename).
Use SIFT/SURF to detect image features/descriptors in both images.
sift = cv2.xfeatures2d.SIFT_create()
kp, des = sift.detectAndCompute(img,None)
Match features using the descriptors.
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1,des2)
Use RANSAC to compute funamental matrix.
cv2.findFundamentalMatrix(pts1, pts2, cv2.FM_RANSAC, 3, 0.99, mask)
mask contains all the inliers. Simply count them to determine if the number of matches satisfying geometrical constraint is large enough.
CAUTION: In case of a planar scene, we use homography instead of a fundamental matrix and the steps described above work out pretty nicely because homography takes a point to a corresponding point in the other image. However, Fundamental matrix takes a point to the corresponding epipolar line in the other image, which makes the entire process a bit less stable. So I would recommend trying these steps a few more times with a little bit of jitter to the feature locations and collating the evidence over more than one trial to make the decision. You can also use more advanced steps to introduce robustness to this process but only if the steps described above don't yield the results you need.

Homography and projective transformation

im trying to write a code that will do projective transformation, but with more than 4 key points. i found this helpful guide but it uses 4 points of reference
https://math.stackexchange.com/questions/296794/finding-the-transform-matrix-from-4-projected-points-with-javascript
i know that matlab uses has a function tcp2form that handles that, but i haven't found a way so far.
anyone can give me some guidance, on how to do so? i can solve the equations using (least squares), but i'm stuck since i have a matrix that is larger than 3*3 and i can't multiple the homogeneous coordinates.
Thanks
If you have more than four control points, you have an overdetermined system of equations. There are two possible scenarios. Either your points are all compatible with the same transformation. In that case, any four points can be used, and the rest will match the transformation exactly. At least in theory. For the sake of numeric stability you'd probably want to choose your points so that they are far from being collinear.
Or your points are not all compatible with a single projective transformation. In this case, all you can hope for is an approximation. If you want the best approximation, you'll have to be more specific about what “best” means, i.e. some kind of error measure. Measuring things in a projective setup is inherently tricky, since there are usually a lot of arbitrary decisions involved.
What you can try is fixing one matrix entry (e.g. the lower right one to 1), then writing the conditions for the remaining 8 coordinates as a system of linear equations, and performing a least squares approximation. But the choice of matrix representative (i.e. fixing one entry here) affects the least squares error measure while it has no effect on the geometric meaning, so this is a pretty arbitrary choice. If the lower right entry of the desired matrix should happen to be zero, you'd computation will run into numeric problems due to overflow.

Resources