Finding data entry points in a blank, scanned application form - image-processing

I am a relative newcomer to image processing and this is the problem I'm facing - Say I have the image of an application form, like this:
Now I would like to detect the locations of all the locations where data is to be entered. In this case, it would be the rectangles divided into a number of boxes like so(not all fields marked):
I can live with the photograph box also being detected. I've tried running the squares.cpp sample in the OpenCV sources, which does not quite get me what I want. I also tried the modified version here - the results were worse(my use case is definitely very different from the OP's in that question).
Also, Hough transforming to get the lines is not really working with/without blur-threshold as the noise in scanned image is contributing to extraneous lines, and also, thresholding is taking away parts of the combs(the small squares), and hence the line detection is not up to the mark.
Note that this form is not a scanned copy of a printed form, but the real input might very well be a noisy, scanned image of a printed form.
While I'm definitely sure that this is possible(at least with some tolerance allowed) and I'm trying to get at the solution, it would be really helpful if I get insights and ideas from other people who might have tried something like this/enjoy hacking on CV problems. Also, it would be really nice if the answers explain why a particular operation was done (e.g., dilation to try and fill up any holes left by thresholding, etc)

Are the forms consistent in any way? Are the "such boxes" the same size on all forms? If you can rely on a consistent size, like the character boxes in the form above, you could use template matching.
Otherwise, the problem seems to be: find any/all rectangles on the image (with a post processing step to filter out any that have a significant amount of markings within, or to merge neighboring rectangles).
The more you can take advantage of the consistencies between the forms, the easier the problem will be. Use any context you can get.
EDIT
Using the gradients (computed by using a Sobel kernel in both the x and the y direction) you can weed out a lot of the noise.
Using both you can find the direction of the gradients (equation can be found here: en.wikipedia.org/wiki/Sobel_operator). Let's say we define a discriminating feature of a box to be a vertical or horizontal gradient. If the pixel's gradient has an orientation that's either straight horizontal or straight vertical, keep it, set all else to white.
To make this more robust to noise, you can use a sliding window (3x3) in which you compute the median orientation. If the median (or mean) orientation of the window is vertical or horizontal, keep the current (middle of the window) pixel, otherwise set it to white.
You can use OpenCV for the gradient computation, and possibly the orientation/phase calculation, but you'll probably need to write the code it do the actual sliding window code. I'm not intimately familiar with OpenCV

Related

Recognition and counting of books from side using OpenCV

Just wish to receive some ideas on I can solve this problem.
For a clearer picture, here are examples of some of the image that we are looking at:
I have tried looking into thresholding it, like otsu, blobbing it, etc. However, I am still unable to segment out the books and count them properly. Hardcover is easy of course, as the cover clearly separates the books, but when it comes to softcover, I have not been able to successfully count the number of books.
Does anybody have any suggestions on what I can do? Any help will be greatly appreciated. Thanks.
I ran a sobel edge detector and used Hough transform to detect lines on the last image and it seemed to be working okay for me. You can then link the edges on the output of the sobel edge detector and then count the number of horizontal lines. Or, you can do the same on the output of the lines detected using Hough.
You can further narrow down the area of interest by converting the image into a binary image. The outputs of all of these operators can be seen in following figure ( I couldn't upload an image so had to host it here) http://www.pictureshoster.com/files/v34h8hvvv1no4x13ng6c.jpg
Refer to http://www.mathworks.com/help/images/analyzing-images.html#f11-12512 for some more useful examples on how to do edge, line and corner detection.
Hope this helps.
I think that #audiohead's recommendation is good but you should be careful when applying the Hough transform for images that will have the library's stamp as it might confuse it with another book (You can see that the letters form some break-lines that will be detected by sobel).
Consider to apply first an edge preserving smoothing algorithm such as a Bilateral Filter. When tuned correctly (setting of the Kernels) it can avoid these such of problems.
A Different Solution That Might Work (But can be slow)
Here is a different approach that is based on pixel marking strategy.
a) Based on some very dark threshold, mark all black pixels as visited.
b) While there are unvisited pixels: Pick the next unvisited pixel and apply a region-growing algorithm http://en.wikipedia.org/wiki/Region_growing while marking its pixels with a unique number. At this stage you will need to analyse the geometric shape that this region is forming. A good criteria to detecting a book is that the region is creating some form of a rectangle where width >> height. This will detect a book and mark all its pixels to the unique number.
Once there are no more unvisited pixels, the number of unique numbers is the number of books you will have + For each pixel on your image you will now to which book does it belongs.
Do you have to keep the books this way? If you can change the books to face back side to the camera then I think you can get more information about the different colors used by different books.The lines by Hough transform or edge detection will be more prominent this way.
There exist more sophisticated methods which are much better in contour detection and segmentation, you can have a look at them here, however it is quite slow, http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html
Once you get the ultrametric contour map, you can perform some computation on them to count the number of books
I would try a completely different approach; with paperbacks, the covers are medium-dark lines whilst the rest of the (assuming white pages) are fairly white and "bloomed", so I'd try to thicken up the dark edges to make them easy to detect, then that would give the edges akin to working with hardbacks which you say you've done.
I'd try something like an erosion to thicken up the edges. This would be a nice, fast operation.

What is the correct method to auto-crop objects from light background?

I'm trying to extract objects from scanned images. There could be a few documents on a white background, and I need to crop and rotate them automatically. This seems like a rather simple task, but I've got stuck at some point and get bad results all the time.
I've tried to:
Binarise the image and get connected components by performing morphological operations.
Perform watershed segmentation by using dilated and eroded binary images as mask components.
Apply Canny detector and fill the contours.
None of this gets me good results. If the object does't have contrast edges (i.e a piece of paper on white background), it splits into a lot of separate components. If I connect these components by applying excessive dilation, background noise also expands and everything becomes a mess.
For example, I have an image:
After applying Canny detector and filling the contours I get something like this:
As you can see, the components are not connected. They are eve too far from each other to be connected by a reasonable amount of dilation. And when I apply watershed to this mask combined with some background points, it yields very bad results.
Some images are noisy:
In this particular case I was able to obtain contour of the whole passport by Canny detector because of it's contrast edges. But threshold method doesn't work here.
If the images are always on a very light background, then you can binarize with a threshold close to the maximum possible value. After that it is a matter of correcting the binary image to get the objects, but this step will vary depending on how your other images look like.
For instance, the following image at left is what we get with a threshold at 99% of the maximum value after a gaussian filtering on the input. After removing components connected to the border and other small components, and also combining with some basic morphological tools, we get the image at right.
This may seem a bit wishy-washy but bear with me:
This looks like quite a challenging case for image processing recipes involving only edge detection, morphological operations and segmentation.
What you are not exploiting here is that you (I believe) know what your document should look like. You are currently looking at completely general solutions which do not take into account this prior knowledge. If you can get some training data then you can go all the way from simple template/patch-based matching (SSD, Normalized Cross-Correlation) to more sophisticated object detection techniques to find the position and rotation of your documents.
My guess is that if your objects are always more or less the same and at the same scale (e.g. passports scanned at a fixed resolution/similar machines) then you can get away with a fairly crude approach. There won't be any one correct method. It's also likely that the technique you end up using will not work until you have done a significant amount of parameter tweaking, so don't give up on anything too quickly.

Image Processing for recognizing 2D features

I've created an iPhone app that can scan an image of a page of graph paper and can then tell me which squares have been blacked out and which squares are blank.
I do this by scanning from left to right and use the graph paper's lines as guides. When I encounter a graph paper line, I start to look for black, until I hit the graph paper line again. Then, instead of continuing along the scan line, I go ahead and completely scan the square for black. Then I continue on to the next box. At the end of the line, I skip down so many pixels before starting the scan on a new line (since I have already figured out how tall each box is).
This sort of works, but there are problems. Sometimes I mistake the graph lines as "black". Sometimes, if the image is skewed, or I don't have uniform lighting across the page, then I don't get good results.
What I'd like to do is to specify a few "alignment" boxes that I then resize and rotate (and skew) the picture to align with those. Then, I was thinking that once I have the image aligned, I would then know where all the boxes are and won't have to scan for the boxes, just scan inside the location of the boxes to see if they are black. This should be faster and more reliable. And if I were to operate on images coming from the camera, I'd have more flexibility in asking the user to align the picture to match the alignment marks, rather than having to align the image myself.
Given that this is my first Image Processing project, I feel like I am reinventing the wheel. I'd like suggestions on how to do this, and whether to utilize libraries like OpenCV.
I am enclosing an image similar to what I would like processed. I am looking for a list of all squares that have a significant amount of black marking, i.e. A8, C4, E7, G4, H1, J9.
Issues to be aware of:
Light coverage of the image may not be ideal, but should be relatively consistent across the image (i.e. no shadows)
All squares may be empty or all dark, and the algorithm needs to be able to determine that
the image may be skewed or rotated about any of the axis. Rotation about the z axis maybe easy to fix. There may be rotation around the x or y axis making ones side of the image be wider than the other. However, if I scan the image in realtime as it comes from the camera, I can ask the user to align the alignment marks with marks on the screen. How best to ensure that alignment to give the user appropriate feedback? Just checking to make sure that the 4 corners are dark could result in a false positive when the camera is pointing to a black surface.
not every square will be equally or consistently blacked, but I think there will be enough black to make it unquestionable to a human eye.
the blue grid may be useful, but there are cases where the black markings may overlap the blue grid. I think a virtual grid is probably better than relying on the printed grid. I would think that using the alignment markers to align the image, would then allow for a precise virtual grid to be laid out. And then the contents of each grid box could be sampled, to see if it was predominantly black, vs scanning from left-to-right, no? Here is another image with more markings on the grid. In this image, in addition to the previous marking in A8, C4, E7, G4, H1, J9, I have marked E2, G8 and G9, and I4 and J4 and you can see how the blue grid is obscured.
This is my first phase of this project. Eventually I'd like to scale this algorithm to be able to process at least a few hundred slots and possibly different colors.
To start with, this problem reminded me a bit of these demo's that might be useful to learn from:
The DNA microarray image processing
The Matlab Sudoku solver
The Iphone Sudoku solver blog post, explaining the image processing
Personally, I think the most simple approach would be to detect the squares in your image.
1) Remove the background and small cruft
f_makebw = #(I) im2bw(I.data, double(median(I.data(:)))/1.3);
bw = ~blockproc(im, [128 128], f_makebw);
bw = bwareaopen(bw, 30);
2) Remove everything but the squares and circles.
se = strel('disk', 5);
bw = imerode(bw, se);
% Detect the squares and cricles via morphology
[B, L] = bwboundaries(bw, 'noholes');
3) Detect the squares using 'extend' from regionprops. The 'Extent' metric measures what proportion of the bounding-box is filled. This makes it a
nice measure to distinguish between circles and squares
stats = regionprops(L, 'Extent');
extent = [stats.Extent];
idx1 = find(extent > 0.8);
bw = ismember(L, idx1);
4) This leaves you with your features, to synchronize or rectify the image with. An easy, and robust way, to do this, is via the Autocorrelation Function.
This gives nice peaks, which are easily detected. These peaks can be matched against the ACF peaks from a template image via the Hungarian algorithm. Once matched, you can correct rotation and scaling as you now have a linear system which you can solve:
x = Ax'
Translation can then be corrected using run-of-the-mill cross correlation against the same pre defined template.
If all goes well, you know have an aligned or synchronized image, which should help considerably in determining the position of the dots.
I've been starting to do something similar using my GPUImage iOS framework, so that might be an alternative to doing all of this in OpenCV or something else. As it's name indicates, GPUImage is entirely GPU-based, so it can have some tremendous performance benefits over CPU-bound processing (up to 180X faster for doing things like processing live video).
As a first stage, I took your images and ran them through a simple luminance thresholding filter with a threshold of 0.5 and arrived at the following for your two images:
I just added an adaptive thresholding filter, which attempts to correct for local illumination variances, and works really well for picking out text. However, in your images it uses too small of an averaging radius to handle your blobs well:
and seems to bring out your grid lines, which it sounds like you wish to ignore.
Maurits provides a more comprehensive description of what you could do, but there might be a way to implement these processing operations as high-performance GPU-based filters instead of relying on slower OpenCV versions of the same calculations. If you could grab rotation and scaling information from this thresholded image, you could construct a transform that could also be applied as a filter to your thresholded image to produce your final aligned image, which could then be downsampled and read out by your application to determine which grid locations were filled in.
These GPU-based thresholding operations run in less than 2 ms for 640x480 frames on an iPhone 4, so it might be possible to chain filters together to analyze incoming video frames as fast as the device's video camera can provide them.

pattern recognition between two very different images

So, my problem is that I have to find common points between two images of a microchip. Here's an example of two images:
Between these two images, we can clearly see some common pattern like the wires on the bottom right of the first images that can be found in relatively the same place in the second image. Also, the sort of white Z shape in the first image can be seen in the second images, a bit harder, but it's there.
I tried to match them with SURF (OpenCV), found no common point at all. Tried to apply some filter on both images, like edge detection, thresholding, and other filter that I could found in GIMP, but whatever I tried, no common point were ever found.
I'd like to know if you have any idea to solve this problem ? My suggestion right now would be to manually match key features in both images with line segments, but preferably, it should be automated.
A solution that uses OpenCV would be preferable, but I'm looking for any suggestion possible. In OpenCV, all pattern matching situation that I saw were problems way more obvious that this one. No difference in color and so on.
Unless realtime is required, do a simple approach to test if rotation can be automated:
Circuit boards like the ones in the images, are often based on perpendicular straight line segments. Hence you can "despeckle" and remove stuff like coffee stains, by finding linesegments.
Think about creating a kernel, that have a line with dark pixels on one side, and bright pixels on the other. Fold it on the image (or cross-correlate it) to identify all pixels that have a sequence of bright/dark pixels which are nearly vertical or horizontal.
you may interlace to speed things up.
edges of stains and speckles may survive this, if you want angles close to 45* representatations!
The resulting image can be interpreted as a sparse pointcloud.
You can now use RANSAC or other similar approaches to describe many of the remaining correlations, as line segments.
* use a 2 point line segment as input model for RANSAC, Degrade if small.
* Determine infinite lines that have many inliers
* use growth or binninng approaches to segmentate lines.
benefits:
high likelyhood of line segment representations that are actually present as circuitry in image. 2 point description of segments, possible transforms are easy.
easy interpretation of data, as it can be overlayed in openCV
Rotation should be easily found as the rotation that matches most found lines to horizontal and/or vertical axis'es.
apply rotation.
repeat for both images.
now you can determine best translation between the images, by simple x,y cross correlation.
If the top image is always of that quality (quasi bilevel patterns, easy edge detection), I would try a good geometric matching algorithm (such as Cognex or Halcon), training with the top image and searching the bottom one.
Maybe it is worth to first compensate rotation (I hope there is no scaling). You would do that by determining the dominant edge direction, possibly using a Hough transform. Or, much better, by careful mechanical alignment of the sensors.
Anyway, chances of success are low, this is a difficult problem.

Shape/Pattern Matching Approach in Computer Vision

I am currently facing a, in my opinion, rather common problem which should be quite easy to solve but so far all my approached have failed so I am turning to you for help.
I think the problem is explained best with some illustrations. I have some Patterns like these two:
I also have an Image like (probably better, because the photo this one originated from was quite poorly lit) this:
(Note how the Template was scaled to kinda fit the size of the image)
The ultimate goal is a tool which determines whether the user shows a thumb up/thumbs down gesture and also some angles in between. So I want to match the patterns against the image and see which one resembles the picture the most (or to be more precise, the angle the hand is showing). I know the direction in which the thumb is showing in the pattern, so if i find the pattern which looks identical I also have the angle.
I am working with OpenCV (with Python Bindings) and already tried cvMatchTemplate and MatchShapes but so far its not really working reliably.
I can only guess why MatchTemplate failed but I think that a smaller pattern with a smaller white are fits fully into the white area of a picture thus creating the best matching factor although its obvious that they dont really look the same.
Are there some Methods hidden in OpenCV I havent found yet or is there a known algorithm for those kinds of problem I should reimplement?
Happy New Year.
A few simple techniques could work:
After binarization and segmentation, find Feret's diameter of the blob (a.k.a. the farthest distance between points, or the major axis).
Find the convex hull of the point set, flood fill it, and treat it as a connected region. Subtract the original image with the thumb. The difference will be the area between the thumb and fist, and the position of that area relative to the center of mass should give you an indication of rotation.
Use a watershed algorithm on the distances of each point to the blob edge. This can help identify the connected thin region (the thumb).
Fit the largest circle (or largest inscribed polygon) within the blob. Dilate this circle or polygon until some fraction of its edge overlaps the background. Subtract this dilated figure from the original image; only the thumb will remain.
If the size of the hand is consistent (or relatively consistent), then you could also perform N morphological erode operations until the thumb disappears, then N dilate operations to grow the fist back to its original approximate size. Subtract this fist-only blob from the original blob to get the thumb blob. Then uses the thumb blob direction (Feret's diameter) and/or center of mass relative to the fist blob center of mass to determine direction.
Techniques to find critical points (regions of strong direction change) are trickier. At the simplest, you might also use corner detectors and then check the distance from one corner to another to identify the place when the inner edge of the thumb meets the fist.
For more complex methods, look into papers about shape decomposition by authors such as Kimia, Siddiqi, and Xiaofing Mi.
MatchTemplate seems like a good fit for the problem you describe. In what way is it failing for you? If you are actually masking the thumbs-up/thumbs-down/thumbs-in-between signs as nicely as you show in your sample image then you have already done the most difficult part.
MatchTemplate does not include rotation and scaling in the search space, so you should generate more templates from your reference image at all rotations you'd like to detect, and you should scale your templates to match the general size of the found thumbs up/thumbs down signs.
[edit]
The result array for MatchTemplate contains an integer value that specifies how well the fit of template in image is at that location. If you use CV_TM_SQDIFF then the lowest value in the result array is the location of best fit, if you use CV_TM_CCORR or CV_TM_CCOEFF then it is the highest value. If your scaled and rotated template images all have the same number of white pixels then you can compare the value of best fit you find for all different template images, and the template image that has the best fit overall is the one you want to select.
There are tons of rotation/scaling independent detection functions that could conceivably help you, but normalizing your problem to work with MatchTemplate is by far the easiest.
For the more advanced stuff, check out SIFT, Haar feature based classifiers, or one of the others available in OpenCV
I think you can get excellent results if you just compute the two points that have the furthest shortest path going through white. The direction in which the thumb is pointing is just the direction of the line that joins the two points.
You can do this easily by sampling points on the white area and using Floyd-Warshall.

Resources