Evaluating the confidence of an image registration process - opencv

Background:
Assuming there are two shots for the same scene from two different perspective. Applying a registration algorithm on them will result in Homography Matrix that represents the relation between them. By warping one of them using this Homography Matrix will (theoretically) result in two identical images (if the non-shared area is ignored).
Since no perfection is exist, the two images may not be absolutely identical, we may find some differences between them and this differences can be shown obviously while subtracting them.
Example:
Furthermore, the lighting condition may results in huge difference while subtracting.
Problem:
I am looking for a metric that I can evaluate the accuracy of the registration process. This metric should be:
Normalized: 0->1 measurement which does not relate to the image type (natural scene, text, human...). For example, if two totally different registration process on totally different pair of photos have the same confidence, let us say 0.5, this means that the same good (or bad) registeration happened. This should applied even one of the pair is for very details-reach photos and the other of white background with "Hello" in black written.
Distinguishing between miss-registration accuracy and different lighting conditions: Although there is many way to eliminate this difference and make the two images look approximately the same, I am looking of measurement that does not count them rather than fixing them (performance issue).
One of the first thing that came in mind is to sum the absolute differences of the two images. However, this will result in a number that represent the error. This number has no meaning when you want to compare it to another registration process because another images with better registration but more details may give a bigger error rather than a smaller one.
Sorry for the long post. I am glad to provide any further information and collaborating in finding the solution.
P.S. Using OpenCV is acceptable and preferable.

You can always use invariant (lighting/scale/rotation) features in both images. For example SIFT features.
When you match these using typical ratio (between nearest and next nearest), you'll have a large set of matches. You can calculate the homography using your method, or using RANSAC on these matches.
In any case, for any homography candidate, you can calculate the number of feature matches (out of all), which agree with the model.
The number divided by the total matches number gives you a metric of 0-1 as to the quality of the model.
If you use RANSAC using the matches to calculate the homography, the quality metric is already built in.

This problem is given two images decide how misaligned they are.
Thats why we did the registration. The registration approach cannot answer itself how bad a job it did becasue if it knew it it would have done it.
Only in the absolute correct case do we know the result: 0
You want a deterministic answer? you add deterministic input.
a red square in a given fixed position which can be measured how rotated - translated-scaled it is. In the conditions of lab this can be achieved.

Related

Interpolation vs Average

I'm new of computer vision concepts and I'd like to know why, when we double the size of an image, we should use bilinear interpolation where pixels haven't values instead of average between nearest known values pixels.
I'm not sure I agree with the premise that you "should use bilinear interpolation". You shouldn't blindly use anything without thinking about it. For example, if your pixels represent the result of a classification and 1 represents wheat, and 2 represents water, and 3 represents barley, you certainly shouldn't take the average and assume that when you enlarge an image of wheat and barley that some ocean suddenly appears in the middle between the fields.
Bilinear interpolation is actually just averaging, except a) it is in 2 dimensions because images are inherently 2-dimensional and b) if you know you are nearer to one point than another, surely it isn't unreasonable to weight your "guesstimated" value (which, after all, you don't actually know) more towards the geometrically closer value?
I guess my answer is really that there are several types of interpolation, and you should apply some thinking to deciding which one is best for your particular circumstances. Sometimes you don't want to introduce new colours because of classification or palette issues, and in these circumstances you need "nearest neighbour". Sometimes "bilinear" is what you need, sometimes "bicubic".

Image similarity of apartment photos

I want to design an algorithm that would find matches in images of the same apartment, when put up by different real estate agents.
Photos are relatively taken in similar time so the interior of the rooms should not change that much but of course every guys takes different pictures from different angles, etc.
(TLDR; a apartment goes for sale, and different real estate guys come in and make their own pictures, and I want to know if the given pictures from various guys are of the same place)
I know that image processing and recognition algorithm selections highly depend on the use case, so could you point me in correct direction given my use-case?
http://reality.bazos.sk/inzerat/56232813/Prenajom-1-izb-bytu-v-sirsom-centre.php
http://reality.bazos.sk/inzerat/56371292/-PRENAJOM-krasny-1i-byt-rekonstr-Kupeckeho-Ruzinov-BA-II.php
You can actually use Clarifai's Custom Training API endpoint, fairly simple and straightforward. All you would have to do is train the initial image and then compare the second to it. If the probability is high, it is likely the same apartment. For example:
In javascript, to declare a positive it is:
clarifai.positive('http://example.com/apartment1.jpg', 'firstapartment', callback);
And a negative is:
clarifai.negative('http://example.com/notapartment1.jpg', 'firstapartment', callback);
You don't necessarily have to do a negative, but it could only help. Then, when you are comparing images to the first aparment, you do:
clarifai.predict('http://example.com/someotherapartment.jpg', 'firstapartment', callback);
This will give you a probability regarding the likeness of the photo to what you've trained ('firstapartment'). This API is basically doing machine learning without the hassle of the actual machine. Clarifai's API also has a tagging input that is extremely accurate with some basic tags. The API is free for a certain number of calls/month. Definitely worth it to check out for this case.
As user Shaked mentioned in a comment, this is a difficult problem. Even if you knew the position and orientation of each camera in space, and also the characteristics of each camera, it wouldn't be a trivial problem to match the images.
A "bag of words" (BoW) approach may be of use here. Rather than try to identify specific objects and/or deduce the original 3D scene, you determine what "feature descriptors" can distinguish objects from one another in your image sets.
https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision
Imagine you could describe the two images by the relative locations of textures and colors:
horizontal-ish line segments at far left
red blob near center left
green clumpy thing at bottom left
bright round object near top left
...
then for a reasonably constrained set of images (e.g. photos just within a certain zip code), you may be able to yield a good match between the two images above.
The Wikipedia article on BoW may look a bit daunting, but I think if you hunt around you'll find an article that describes "bag of words" for image processing clearly. I've seen a very good demo of a BoW approach used to identify objects such as boats and delivery vans in arbitrary video streams, and it worked impressively well. I wish I had a copy of the presentation to pass along.
If you don't suspect the image to change much, you could try the standard first step of any standard structure-from-motion algorithm to establish a notion of similarity between a pair of images. Any pair of images are similar if they contain a number of matching image features larger than a threshold which satisfy the geometrical constraint of the scene as well. For a general scene, that geometrical constraint is given by a Fundamental Matrix F computed using a subset of matching features.
Here are the steps. I have inserted the opencv method for each step, but you could write your methods too:
Read the pair of images. Use img = cv2.imread(filename).
Use SIFT/SURF to detect image features/descriptors in both images.
sift = cv2.xfeatures2d.SIFT_create()
kp, des = sift.detectAndCompute(img,None)
Match features using the descriptors.
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1,des2)
Use RANSAC to compute funamental matrix.
cv2.findFundamentalMatrix(pts1, pts2, cv2.FM_RANSAC, 3, 0.99, mask)
mask contains all the inliers. Simply count them to determine if the number of matches satisfying geometrical constraint is large enough.
CAUTION: In case of a planar scene, we use homography instead of a fundamental matrix and the steps described above work out pretty nicely because homography takes a point to a corresponding point in the other image. However, Fundamental matrix takes a point to the corresponding epipolar line in the other image, which makes the entire process a bit less stable. So I would recommend trying these steps a few more times with a little bit of jitter to the feature locations and collating the evidence over more than one trial to make the decision. You can also use more advanced steps to introduce robustness to this process but only if the steps described above don't yield the results you need.

License Plate Image Matching

I would like to match two license plate images, sample images given below
Here these two license plate belong to same vehicle, hence they should give match.
There may be zoom and slight rotation in these images, also only a part of the original may be visible as given in the example.
If the License plate belong to different vehicle algorithm should say it is different.
Which is best algorithm for doing this ?
I would suggest you use openCV functions from Features2D Framework, and Homography method to handle the scaling and rotation problem. Specifically, in Features2D, there are classes that may be helpful for your to detect the letter, extract them, and match your two templates after extraction.
Frankly this is a non-trivial question.
Just to list some obvious options:
Implement one of the numerous character recognition softwares, and
get the string of characters, and then do a search for the substring
in another string.
For images with almost no difference in zoom
level, Use edge detection filters, like canny edge detection, to
enhance the image, then use ICP (Iterative Closest Point), letting
each edge pixel provide a vector to the closest edge pixel in the
other image, with a similar value. this typically aligns images if
they are similar enough. The final score tells you how similar they
are.
For very large zoom levels, use multiple rotation and zoom
hypothesis, and for each, scale the images and do cross correlation
of the two images. select the hypothesis, that provides the
coordinates with the best correlation, and use the point of
correlation, as the x and y offset. The value of the correlation
tells you how good a fit you have..
many other smarter algorithms have been produced for image fitting. However, you have much larger problems.
The two example images you provide does not show the entire licenseplate, so you will not be able to say anything better than, "the probabillity of a match is larger than zero", as the number of visible characters increase, so does the probabillity of a match.
you could argue that small damages to a license plate also increases the probabillity, in that case cross correlation or similar method is needed to evaluate the probabillity of a match.

Scoreboard digit recognition using OpenCV

I am trying to extract numbers from a typical scoreboard that you would find at a high school gym. I have each number in a digital "alarm clock" font and have managed to perspective correct, threshold and extract a given digit from the video feed
Here's a sample of my template input
My problem is that no one classification method will accurately determine all digits 0-9. I have tried several methods
1) Tesseract OCR - this one consistently messes up on 4 and frequently returns weird results. Just using the command line version. If I actually try to train it on an "alarm clock" font, I get unknown character every time.
2) kNearest with OpenCV - I search a database consisting of my template images (0-9) and see which one is nearest. I frequently get confusion between 3/1 and 7/1
3) cvMatchShapes - this one is fairly bad, it usually can't tell the difference between 2 of the digits for each input digit
4) Tangent Distance - This one is the closest, but the smallest tangent distance between the input and my templates ends up mapping "7" to "1" every time
I'm really at a loss to get a classification algorithm for such a simple problem. I feel I have cleaned up the input fairly well and it's a fairly simple case for classification but I can't get anything reliable enough to actually use in practice. Any ideas about where to look for classification algorithms, or how to use them correctly would be appreciated. Am I not cleaning up the input? What about a better input database? I don't know what else I'd use for input, each digit and template looks spot on at this point.
The classical digit recognition, which should work well in this case is to crop the image just around the digit and resize it to 4x4 pixels.
A Discrete Cosine Transform (DCT) can be used to further slim down the search space. You could select the first 4-6 values.
With those values, train a classifier. SVM is a good one, readily available in OpenCV.
It is not as simple as emma's or martin suggestions, but it's more elegant and, I think, more robust.
Given the width/height ratio of your input, you may choose a different resolution, like 3x4. Choose the smallest one that retains readable digits.
Given the highly regular nature of your input, you could define a set of 7 target areas of the image to check. Each area should encompass some significant portion of one of the 7 segments of each digital of the display, but not overlap.
You can then check each area and average the color / brightness of the pixels in to to generate a probability for a given binary state. If your probability is high on all areas you can then easily figure out what the digit is.
It's not as elegant as a pure ML type algorithm, but ML is far more suited to inputs which are not regular, and in this case that does not seem to apply - so you trade elegance for accuracy.
Might sound silly but have you tried simply checking for black bars vertically and then horizontally in the top and bottom halfs - left and right of the centerline ?
If you are trying text recognition with Tesseract, try passing not one digit, but a number of duplicated digits, sometimes it could produce better results, here's the example.
However, if you're planning a business software, you may want to have a look at a commercial OCR SDK. For example, try ABBYY FineReader Engine. It's not affordable for free to use applications, but when it comes to business, it can a good value to your product. As far as i know, ABBYY provides the best OCR quality, for example check out http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
You want your scorecard image inputs S feeding an algorithm that maps them to {0,1,2,3,4,5,6,7,8,9}.
Let V denote the set of n-tuples of integers.
Construct an algorithm α that maps each image S to a n-tuple
(k1,k2,...,kn)
that can differentiate between two different scoreboard digits.
If you can specify the range of α then you only have to collect the vectors in V that correspond to a digit in order to solve the problem.
I've applied this idea using Martin Beckett's idea and it works. My initial attempt was a simple injection into a 2-tuple by vertical left-to-right summing, with the first integer a image column offset and the second integer was the length of a 'nice' vertical line.
This did not work - images for 6 and 8 would map to the same vectors. So I needed another mini-info-capture for my digit input types (they are not scoreboard) and a 3-tuple info vector does the trick.

OpenCV RANSAC returns the same transformation everytime

I am confused as to how to use the OpenCV findHomography method to compute the optimal transformation.
The way I use it is as follows:
cv::Mat h = cv::findHomography(src, dst, CV_RANSAC, 5.f);
No matter how many times I run it, I get the same transformation matrix. I thought RANSAC is supposed to randomly select a subset of points to do the fitting, so why does it return the same transformation matrix every time? Is it related to some random number initialization? How can I make this behaviour actually random?
Secondly, how can I tune the number of RANSAC iterations in this setup? Usually the number of iterations is based on inlier ratios and things like that.
I thought RANSAC is supposed to randomly select a subset of points to do the fitting, so why does it return the same transformation matrix every time?
RANSAC repeatedly selects a subset of points, then fits a model based upon them, then checks how many data points in the data set are inliers given that fitted model. Once it's done that lots of times, it picks the fitted model that had the most inliers, and refits the model to those inliers.
For any given data set, set of variable model parameters, and rule for what constitutes an inlier, there will exist one or more (but often exactly one) largest possible set of "inliers". For example, given this data set (image from Wikipedia):
... then with some sort of reasonable definition of an outlier, the maximal possible set of inliers any linear model can have is the one in blue below:
Let's call the set of blue points above - the maximal possible set of inliers - I.
If you randomly select a small number of points (e.g. two or three) and draw a line of best fit through them, it's hopefully intuitively obvious that it'll only take you a handful of tries until you hit an iteration where:
all the randomly-selected points you pick are from I, and so
the line of best fit through those points is roughly equal to the line of best fit in the graph above, and so
the set of inliers found on that iteration is exactly I
From that iteration onwards, all further iterations are a waste that cannot possibly improve the model further (although RANSAC has no way of knowing this, since it doesn't magically know when it's found the maximal set of inliers).
If you have a large enough number of iterations relative to the size of your data set, and a large enough proportion of the data set are inliers, then you will eventually find the maximal set of inliers with a close to 100% chance every time you run RANSAC. As a consequence, RANSAC will (almost) always output exactly the same model.
And that's a good thing! Often, you want RANSAC to find the absolute maximal set of inliers and don't want to settle for anything less. If you're getting different results each time you run RANSAC in such a scenario, that's a sign that you want to increase your number of iterations.
(Of course, in the case above we're talking about trying to fit a line through points in a 2D plane, which isn't what findHomography does, but the principle is the same; there will typically still be a single maximal set of inliers and eventually RANSAC will find it.)
How can I make this behaviour actually random?
Decrease the number of iterations (maxIters) so that RANSAC sometimes fails to find the maximal set of inliers.
But there's generally no reason to do this besides pure intellectual curiosity; you'll basically be deliberately telling RANSAC to output an inferior model.
findHomography will already give you the optimal transformation. The real question is about the meaning of optimal.
For example, with RANSAC you'll have the model with maximum number of inliers, while with LMEDS you'll have the model with minimum median error.
You can modify default behavior by:
changing the number of iteration of RANSAC by setting maxIters (max number allowed is 2000)
decreasing (increasing) the ransacReprojThreshold used to validate a inliers and outliers (usually between 1 and 10).
Regarding you questions.
No matter how many times I run it, I get the same transformation matrix.
Probably your points are good enough that you find always the optimal model.
I thought RANSAC is supposed to randomly select a subset of points to do the fitting
RANSAC (RANdom SAmple Consensus) first selects a random subset, the checks if the model built with these points is good enough. If not, it selects another random subset.
How can I make this behaviour actually random?
I can't imagine a scenario where this would be useful, but you can randomly select 4 couples of points from src and dst, and use getPerspectiveTransform. Unless your points are perfect, you'll get a different matrix for each subset.

Resources