Stereo-Processing, Census Based - image-processing

Hi I'm currently trying to implement a stereo matching algorithm in c and I'm having trouble to understand a part in the
paper.
My Problem is the part after the subpixel calculation on page 17. I don't understand it how to get the subpixel disparity map for both directions. Also I'm a little bit confused if my cost aggregation is correct. It's recommended to use a 5x5 windows and sum the values over this block. Do I sum all values in this 5x5 block or do I add every second in every second row, like I did for the census transformation? Thanks for the help!

Related

How to convert kernel to matrix notation?

I'm trying to understand the bicubic convolution algorithm and haven't been able to understand how the kernel given as a piece wide function,
is turned into this matrix:
I understand to arrive at the matrix a was set to -0.5. No matter how I look at it I can't arrive at the non-symmetric matrix shown.
I've looked through the paper by Keys, but he does not expand into matrix notation and I've struggled with how to get there.
Any insight would be much appreciated.
Step 1 to see the relation is to multiply the function W(x) with the sampled input data f[n] for a given shift t. This gives 5 weights multiplying to 5 input samples, and added together to form an output sample p(t).
The matrix used to compute p(t) is not symmetric because, for any shift t that is not 0, the weights applied to the samples are not symmetric either. You can see this by writing out W(t+i), which are the weights applied to the 5 samples around the output position t (i in [-2,2]).
I've found and understood where Keys describes the process. You can follow along from top to bottom in the image below, but the most important bit to note is Equation 7.
All of the values within the matrix come from the coefficients of the c-terms. The first row of the matrix corresponds to the coefficients of the constant terms, and the first column corresponds to the c_j-1 terms. This can be seen by comparing the figure below to Equation 7's coefficients:
I was able to use this understanding to implement the cubic convolution method to interpolate a surface for which I was able to tune the value of a in order to see the response. I'm happy to help expand on this if anything is unclear!

How to use GMMs returned by opencv grabcut to calculate possibilities of a pixel belonging to foreground or background

I have been scratching my head try to understand how to use the GMM model returned by the opencv grabcut function(python API), the GMM models returned are 2* 64 elements tuple which I assume contains both the mean and variance information but I don't know how to apply it to a pixel that contains 3 color channel to predict how likely it belongs to the foreground or the background. I didn't manage to find any example code that does anything with the GMM models returned by the grabcut function.
Alternatively, I understand that I can use EM.predict to obtain the possibilities if I build the graph and train background/foreground using EM. But I want to be able to use grabcut the way it's written.
Any help will be greatly appreciated!
Turned out this is a quite complicated issue, the correct order to calculate this is to first assign one of the GMMs component to this pixel and then calculate the probability based on the weight, mean and covariance of the Gaussian assigned.

How to make the labels of superpixels to be locally consistent in a gray-level map?

I have a bunch of gray-scale images decomposed into superpixels. Each superpixel in these images have a label in the rage of [0-1]. You can see one sample of images below.
Here is the challenge: I want the spatially (locally) neighboring superpixels to have consistent labels (close in value).
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested. I have also heard about Conditional Random Field (CRF). Is it helpful?
Any suggestion would be welcome.
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested.
And why is that? Why do you not consider helpful advice of your colleagues, which are actually right. Applying smoothing function is the most reasonable way to go.
I have also heard about Conditional Random Field (CRF). Is it helpful?
This also suggests, that you should rather go with collegues advice, as CRF has nothing to do with your problem. CRF is a classifier, sequence classifier to be exact, requiring labeled examples to learn from and has nothing to do with the setting presented.
What are typical approaches?
The exact thing proposed by your collegues, you should define a smoothing function and apply it to your function values (I will not use a term "labels" as it is missleading, you do have values in [0,1], continuous values, "label" denotes categorical variable in machine learning) and its neighbourhood.
Another approach would be to define some optimization problem, where your current assignment of values is one goal, and the second one is "closeness", for example:
Let us assume that you have points with values {(x_i, y_i)}_{i=1}^N and that n(x) returns indices of neighbouring points of x.
Consequently you are trying to find {a_i}_{i=1}^N such that they minimize
SUM_{i=1}^N (y_i - a_i)^2 + C * SUM_{i=1}^N SUM_{j \in n(x_i)} (a_i - a_j)^2
------------------------- - --------------------------------------------
closeness to current constant to closeness to neighbouring values
values weight each part
You can solve the above optimization problem using many techniques, for example through scipy.optimize.minimize module.
I am not sure that your request makes any sense.
Having close label values for nearby superpixels is trivial: take some smooth function of (X, Y), such as constant or affine, taking values in the range [0,1], and assign the function value to the superpixel centered at (X, Y).
You could also take the distance function from any point in the plane.
But this is of no use as it is unrelated to the image content.

Is it possible to make two grayscale images stastistically equivalent?

I have 2 grayscale images say G1 and G2 . I also have the statistics (min ,max ,mean and Standard Deviation). I would like to change G2 such that the statistics of G2 (min ,max,mean and SD)match G1. I have tried arithmetic scaling and got the min and max values of both G1 and G2 to match but mean and SD are still different. I have also tried Histogram fitting of G2 in G1 but that did not do what i wanted either. I am using a software called SPIDER this a question applicable to image-processing which can be performed using different software packages(OpenCV MATLABetc) .Any ideas and suggestions would be greatly appreciated.
The easiest thing to do is to apply histogram equalization to both images (histeq in MATLAB). If you do not want to change both images, then you can do histogram matching, but that's a bit more complicated.
You can generate a mapping of input to output based on a simple curve. Start with the values that don't have any dependencies, min and max - those will set the ends of the curve. Now map the mean values to create a single point in the middle of the curve. To modify the standard deviation, you change the shape of the curve between the mean and the endpoints - a curve that is flatter in the middle will give less deviation, and a curve that is flatter towards the ends but steeper in the middle will magnify it.
Edit: I haven't given this enough thought yet, changing the shape of the curve will also change the mean. But I think it can be worked into something usable.
I marked the histogram equalization answer as right because it gave me the best results however I was unable to make the 2 images exactly statistically equivalent as such

Scoreboard digit recognition using OpenCV

I am trying to extract numbers from a typical scoreboard that you would find at a high school gym. I have each number in a digital "alarm clock" font and have managed to perspective correct, threshold and extract a given digit from the video feed
Here's a sample of my template input
My problem is that no one classification method will accurately determine all digits 0-9. I have tried several methods
1) Tesseract OCR - this one consistently messes up on 4 and frequently returns weird results. Just using the command line version. If I actually try to train it on an "alarm clock" font, I get unknown character every time.
2) kNearest with OpenCV - I search a database consisting of my template images (0-9) and see which one is nearest. I frequently get confusion between 3/1 and 7/1
3) cvMatchShapes - this one is fairly bad, it usually can't tell the difference between 2 of the digits for each input digit
4) Tangent Distance - This one is the closest, but the smallest tangent distance between the input and my templates ends up mapping "7" to "1" every time
I'm really at a loss to get a classification algorithm for such a simple problem. I feel I have cleaned up the input fairly well and it's a fairly simple case for classification but I can't get anything reliable enough to actually use in practice. Any ideas about where to look for classification algorithms, or how to use them correctly would be appreciated. Am I not cleaning up the input? What about a better input database? I don't know what else I'd use for input, each digit and template looks spot on at this point.
The classical digit recognition, which should work well in this case is to crop the image just around the digit and resize it to 4x4 pixels.
A Discrete Cosine Transform (DCT) can be used to further slim down the search space. You could select the first 4-6 values.
With those values, train a classifier. SVM is a good one, readily available in OpenCV.
It is not as simple as emma's or martin suggestions, but it's more elegant and, I think, more robust.
Given the width/height ratio of your input, you may choose a different resolution, like 3x4. Choose the smallest one that retains readable digits.
Given the highly regular nature of your input, you could define a set of 7 target areas of the image to check. Each area should encompass some significant portion of one of the 7 segments of each digital of the display, but not overlap.
You can then check each area and average the color / brightness of the pixels in to to generate a probability for a given binary state. If your probability is high on all areas you can then easily figure out what the digit is.
It's not as elegant as a pure ML type algorithm, but ML is far more suited to inputs which are not regular, and in this case that does not seem to apply - so you trade elegance for accuracy.
Might sound silly but have you tried simply checking for black bars vertically and then horizontally in the top and bottom halfs - left and right of the centerline ?
If you are trying text recognition with Tesseract, try passing not one digit, but a number of duplicated digits, sometimes it could produce better results, here's the example.
However, if you're planning a business software, you may want to have a look at a commercial OCR SDK. For example, try ABBYY FineReader Engine. It's not affordable for free to use applications, but when it comes to business, it can a good value to your product. As far as i know, ABBYY provides the best OCR quality, for example check out http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
You want your scorecard image inputs S feeding an algorithm that maps them to {0,1,2,3,4,5,6,7,8,9}.
Let V denote the set of n-tuples of integers.
Construct an algorithm α that maps each image S to a n-tuple
(k1,k2,...,kn)
that can differentiate between two different scoreboard digits.
If you can specify the range of α then you only have to collect the vectors in V that correspond to a digit in order to solve the problem.
I've applied this idea using Martin Beckett's idea and it works. My initial attempt was a simple injection into a 2-tuple by vertical left-to-right summing, with the first integer a image column offset and the second integer was the length of a 'nice' vertical line.
This did not work - images for 6 and 8 would map to the same vectors. So I needed another mini-info-capture for my digit input types (they are not scoreboard) and a 3-tuple info vector does the trick.

Resources