Localization of numbers within a complex scene image - image-processing

First of all, I very much appreciate the help provided by the experts here at SO. The questions posed by many and answered by the experts has been of immense benefit to me. It had helped me with a very crucial problem few months back when I was a student doing my thesis.
Right now I am working on a problem to detect (and then recognize) numbers in a complex scene image. You can check out these images here: http://imageshack.us/g/823/dsc1757w.jpg/. These are pictures of marathon runners with their numbers on the front of their shirts. I have to detect all the numbers that appear in the image and then recognize them. The recognition wont be difficult as these appear to be OCR friendly characters. The crucial thing is how to detect these numbers.
I had an idea to first color filter it for black color. But when I tried in Matlab, the results were not encouraging, as we can see that many of the regions in the image qualify this criteria (the clothes, some shadows behind the runners, the shadows in the foliage, etc). Either I need to classify these characters from these other regions or need some other good technique.
There are papers available and I have gone through some of them, like the SWT, DWT, etc., but I have a feeling they wont be of much help. I was thinking some kind of training algorithm might be useful. There is another reason for this, in future there might be other photos with possibly different fonts, etc., so I think a dedicated algorithmic approach might fail. Can anyone point me in the right direction?
I am not a novice in image processing, but not an expert either. So, any and all help/suggestion in this regard will be greatly appreciated :) .
Thanks,
MD

You know that your problem is not a simple one, but it seems very interesting!
Although I don't have any solutions for you, I will just share my thoughts in hope that you can make something out of it.
Let's take 2 of your photos as examples:
Photo-A: http://imageshack.us/photo/my-images/59/dsc0275a.jpg/
It shows a single person with a relative "big" green label with numbers in his shirt.
Photo-B: http://imageshack.us/photo/my-images/546/dsc0243u.jpg/
It shows a lot of people with red smaller labels in their shirts.
(The labels' height in pixels is about 1/5 of the label in Photo-A)
Considering the above photos, I will try to write some random thoughts which may help...
(a) Define your scale: There is no point to apply a search algorithm to find labels from 2x2 pixels up-to the full image resolution. You must define the minimum/maximum limits for width & height of a label. Those limits may depend on many different factors:
(1) One factor is the real size of labels (defined by the distance of people from camera) which can be defined as a percentage of the image width & height.
(2) Another factor is the actual reading accurracy of the OCR you are going to use. If the numbers' image height is smaller than Y1 pixels or bigger than Y2 pixels the OCR will not be able to read it (it sounds strange but it's true: big images may seem very clear to the human eye, but an OCR may have problems reading it).
(b) Find the area(s) of interest: In your case, this is equivalent to "Find the approximate position of labels". We can define an athlete label roughly as "An (almost) rectangular area, which may be a bit inclined relative to photo borders, and contains: A central area of black + color C1 [e.g. red or green] + a white (=neutral) area on top and/or bottom of it".
A possible algorithm to find the approximate position of a label is:
(1) Traverse all image left-to-right, top-to-bottom and examine a square area of MinHeight/2 x MinHeight/2
(2) Create the histogram of the square area (or posterize it e.g. to 8 levels) and try to find if there is only Black + Another color C1 in a percentage of e.g. Black: 40% +/- 10, Color: 60% +/- 10%
(3) If (2) is true try to expand the area to Right and Bottom while the percentages are kept in the specified limits
(4) If the square is fully expanded, check if the expanded area size is inside the min/max limits of width/height you specified in (a). If not, go to step 1
(5) Process the expanded area to read the numbers - see (c) bellow
(6) Goto to step 1
(c) Process the area(s) of interest: Try the following steps:
(1) Convert each image-area to Grayscale by applying a color filter that burn Color C1 to white.
(2) Equalize the Grayscale to make the black letters stand-out
(3) If an inclination has been detected, perform a reverse rotation on the image-area to make the letters as horizontal as possible.
(4) Feed the area to an OCR trained only for numbers
Good luck with your project!

You could try to contact the author of this software:
Yaroslav is an active member of StackOverflow.

Related

Detecting common colors for groups of objects

I have an image with a collection of objects in K given perceived colors. Providing I extract those objects, how could I cluster them by their perceived color?
Let me give you an example. I am trying to cluster two football teams - so there will be two teams, referees and a keeper (or two, but that`s a rare situation) on the image - 3, 4 or 5 clusters.
For a human's eye, it`s an easy situation. On the picture above, we have white players, red players and a black ref. But it turns out not so easy for automatic processing.
What I have tried so far:
1) I've started working on the BGR colorspace, then tried HSV and now I am exploring CIE Luv, as I read it has unified distances describing the perceived differences between colors.
2) [BGR and HSV] taking the most common color from the contour (not the bounding box). this didn' work at all because of the noise (green field getting in the way), the quality of the image, the position of the player, etc. Colors were pretty much random.
3) [CIE Luv] Resizing all players' boxes to a common size and taking a small portion of the image from the middle (as marked by a black rectangle in the example below).
Taking the mean value of all pixels in each player's window and adding to the list (so, it`s one pixel with the mean value per player). Using K-means (with a defined number of clusters) to find out clusters on that list. This has proven somewhat successful, for the image above I have redish, white and blackish centres in the clusters.
Unfortunately, the assignment of players back to these clusters is pretty much random. I am doing that by calculating the mean color for each player like I described above and then measuring the distance to each cluster. A player might be assigned to the white cluster on one frame and to the red one on the next. Part of the problem might be that the window in the middle of the player's box will sometimes catch a number, grass or shorts, instead of the jersey.
I have already spent a considerable amount of time on trying to figure that out, grateful for any help.
I may be overcomplicating the problem since you just have 3 classes, but try training an SVM classifier based on HOG descriptors. maybe try LDA to improve speed
Some references -
1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.627.6465&rep=rep1&type=pdf - skip to recognition part
2] https://rodrigob.github.io/documents/2013_ijcnn_traffic_signs.pdf - skip to recognition part.
3] https://www.learnopencv.com/handwritten-digits-classification-an-opencv-c-python-tutorial/ - if you want to jump into the code right away
This will always work as long as your detection is good. and can also help to identify different players based on their shirt number
(maybe more???) if you train it right
EDIT: Okkay I have another idea, based on colour segmentation since that was your original approach and require less work (maybe not? color segmentation is a pain! also LIGHTING! LIGHTING! LIGHTING!).
Create a green mask and create a threshold so you detect as little grass as possible when doing your kmeans. Then instead of finding mean, try median instead, that will get you closer to red, coz white is detected as 0 and mean just drops drastically, median doesnt. So it'll be way more robust and you should be able to sort players better (hair color and skin color shouldnt affect it too much)
EDIT 2: Just noticed, if you use the black rectangle you'll get shirt number more (which is white), gonna mess up your classifier, use original box with green masked out
EDIT 3: Also. You can just create 3 thresholds for your required colors and split them up! don't really need Kmeans in this actually. Basically you just need your detected boxes to give out a value inside that threshold. Try the median method I mentioned above. Should improve. Also, might need some more minor tweaks here and there (blur, morphology etc to improve detection)

Recognition and counting of books from side using OpenCV

Just wish to receive some ideas on I can solve this problem.
For a clearer picture, here are examples of some of the image that we are looking at:
I have tried looking into thresholding it, like otsu, blobbing it, etc. However, I am still unable to segment out the books and count them properly. Hardcover is easy of course, as the cover clearly separates the books, but when it comes to softcover, I have not been able to successfully count the number of books.
Does anybody have any suggestions on what I can do? Any help will be greatly appreciated. Thanks.
I ran a sobel edge detector and used Hough transform to detect lines on the last image and it seemed to be working okay for me. You can then link the edges on the output of the sobel edge detector and then count the number of horizontal lines. Or, you can do the same on the output of the lines detected using Hough.
You can further narrow down the area of interest by converting the image into a binary image. The outputs of all of these operators can be seen in following figure ( I couldn't upload an image so had to host it here) http://www.pictureshoster.com/files/v34h8hvvv1no4x13ng6c.jpg
Refer to http://www.mathworks.com/help/images/analyzing-images.html#f11-12512 for some more useful examples on how to do edge, line and corner detection.
Hope this helps.
I think that #audiohead's recommendation is good but you should be careful when applying the Hough transform for images that will have the library's stamp as it might confuse it with another book (You can see that the letters form some break-lines that will be detected by sobel).
Consider to apply first an edge preserving smoothing algorithm such as a Bilateral Filter. When tuned correctly (setting of the Kernels) it can avoid these such of problems.
A Different Solution That Might Work (But can be slow)
Here is a different approach that is based on pixel marking strategy.
a) Based on some very dark threshold, mark all black pixels as visited.
b) While there are unvisited pixels: Pick the next unvisited pixel and apply a region-growing algorithm http://en.wikipedia.org/wiki/Region_growing while marking its pixels with a unique number. At this stage you will need to analyse the geometric shape that this region is forming. A good criteria to detecting a book is that the region is creating some form of a rectangle where width >> height. This will detect a book and mark all its pixels to the unique number.
Once there are no more unvisited pixels, the number of unique numbers is the number of books you will have + For each pixel on your image you will now to which book does it belongs.
Do you have to keep the books this way? If you can change the books to face back side to the camera then I think you can get more information about the different colors used by different books.The lines by Hough transform or edge detection will be more prominent this way.
There exist more sophisticated methods which are much better in contour detection and segmentation, you can have a look at them here, however it is quite slow, http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html
Once you get the ultrametric contour map, you can perform some computation on them to count the number of books
I would try a completely different approach; with paperbacks, the covers are medium-dark lines whilst the rest of the (assuming white pages) are fairly white and "bloomed", so I'd try to thicken up the dark edges to make them easy to detect, then that would give the edges akin to working with hardbacks which you say you've done.
I'd try something like an erosion to thicken up the edges. This would be a nice, fast operation.

which algorithm to choose for object detection?

I am interested in detecting single object more precisely a fire extinguisher which has no inter class variability (all fire extinguisher looks same). However, The application is supposedly realtime i.e a robot is exploring the environment and whenever it sees the object of interest it should be able to detect it and give pixel coordinates of it.
My question is which algorithm will be good choice for this task?
1. Is this a classification problem and should we use features(sift/surf etc) + bow +svm?
2. some other solution (no idea yet).
Any kind of input will be appreciated.
Thanks.
(P.S bear with me i am newbie to computer vision and stack over flow)
update1:
Height varies all are mounted on the wall but with different height. I tried with SIFT features and bow but it is expensive to extract bow descriptors in testing part. Moreover I have no idea how to locate the object(pixel coordinates) inside the image after its been classified positive.
update 2:
I finally used sift + bow + svm and am able to classify the object. But using this technique, i only get output interms of whether the object is present in the scene or not?
How can i detect the object i.e getting the bounding box or centre of the object. what is the compatible approach with the above method for achieving these results.
Thank you all.
I would suggest using color as the main feature to look for, and only try other features as needed. The fire extinguisher red is very distinctive, and should not occur too often elsewhere in an office environment. Other, more computationally expensive tests can then be performed only in regions of the right color.
Here is a good tutorial for color detection that also explains how to find good thresholds for your desired color.
I would suggest the following approach:
denoise your image with a median filter
convert the image to HSV format (Hue, Saturation, Value)
select pixels close to that particular shade of red with InRange()
Now you have a binary image image that contains only the pixels that are red.
count the number of red pixels with CountNonZero()
If that number is too small, abort
remove noise from the binary image by morphological opening / closing
find contours of all blobs in your picture with findContours or the CvBlob library
check if there are blobs of the correct width, correct height and correct width/height ratio
since your fire extinguishers are vertical cylinders, the width/height ratio will be constant from every angle. The width and height will of course vary somewhat with distance to the camera.
if the width and height do not match, abort
repeat these steps to find the black-colored part on the bottom of the extinguisher,
abort if there is no black region with correct width/height below the red region
(perhaps also repeat these steps for the metallic top and the yellow rectangle)
These tests should all be very fast. If they are too slow, you could reduce the resolution of your input images.
Depending on your environment, it is possible that this is already a robust enough test. If not, you can proceed with sift/surf feature matching, but only in a small region around the blobs with the correct color. You also do not necessarily have to do that for each frame, each n-th frame should be be enough for confirmation.
This is a old question .. but will still like to give my recommendation to use YOLO algorithm to solve this problem.
YOLO fits very well to this scenario.

OpenCV: comparing simple images with small difference

I have a bunch of "simple" images and I want to compare if they are similar together. I compare them to each other using template matching (cv::matchTemplate) and results are quite good.
Now I want to fine tune my program and I face a problem. For example I have two images which look very much alike. Only differences they have is that another one has thicker line and the digit front of item is different. When both images are small, one pixell difference in line thickness makes big result differences when doing template matching. When line thicknesses are same and only difference is the front digit, I get template matching result something like 0.98 with CV_TM_CCORR_NORMED when match successful. When line thickness is different matching result is something like 0.95.
I cannot decrease my threshold value below 0.98 because some other similar images have same line thickness.
Here are example images:
So what options do I have?
I have tried:
dilate the original and template
erode also both
morphologyEx both
calculating keypoints and comparing them
finding corners
But no big success yet. Are those images too simple that detecting "good features" is hard?
Any help is very wellcome.
Thank you!
EDIT:
Here are some other example images. What my program consider as similar are put in same zip-folder.
ZIP
A possible way might be thinning the two images, so that every line is of one pixel width, since the differing thickness is causing you the main problem with similarity.
The procedure would be to first binarize/threshold the images, then apply a thinning operation on both images, so both are now having the same thickness of 1 px. Then use the usual template matching that you used before with good results.
In case you'd like more details on the thinning/skeletonization of binary images here are a few OpenCV implementations posted on various discussion forums and OpenCV groups:
OpenCV code for thinning (Guo and Hall algo, works with CvMat inputs)
The JR Parker implementation using OpenCV
Possibly more efficient code here (uses OpenCV optimized access methods a lot, however most of the page is in Japanese!)
And lastly a brief overview of thinning in case you're interested.
You need something more elementary here, there isn't much reason to go for fancy methods. Your figures are already binary ones, and their shapes are very similar overall.
One initial idea: consider the upper points and bottom points in a certain image and form a upper hull and a bottom hull (simply a hull, not a convex hull or anything else). A point is said to be an upper point (respec. bottom point) if, given a column i, it is the first point starting at the top (bottom) of the image that is not a background point in i. Also, your image is mostly one single connected component (in some cases there are vertical bars separated, but that is fine), so you can discard small components easily. This step is important for your situation because I saw there are some figures with some form of noise that is irrelevant to the rest of the image. Considering that a connected component with less than 100 points is small, these are the hulls you get for the respective images included in the question:
The blue line is indicating the upper hull, the green line the bottom hull. If it is not apparent, when we consider the regional maxima and regional minima of these hulls we obtain the same amount in both of them. Furthermore, they are all very close except for some displacement in the y axis. If we consider the mean x position of the extrema and plot the lines of both images together we get the following figure. In this case, the lines in blue and green are for the second image, and the lines in red and cyan for the first. Red dots are in the mean x coordinate of some regional minima, and blue dots the same but for regional maxima (these are our points of interest). (The following image has been resized for better visualization)
As you can see, you get many nearly overlapping points without doing anything. If we do even less, i.e. not even care about this overlapping, and proceed to classify your images in the trivial way: if an image a and another image b have the same amount of regional maxima in the upper hull, the same amount of regional minima in the upper hull, the same amount of regional maxima in the bottom hull, and the same amount of regional minima in the bottom hull, then a and b belong to the same class. Doing this for all your images, all images are correctly grouped except for the following situation:
In this case we have only 3 maxima and 3 minima for the upper hull in the first image, while there are 4 maxima and 4 minima for the second. Following you see the plots for the hulls and points of interest obtained:
As you can notice, in the second upper hull there are two extrema very close. Smoothing this curve eliminates both extrema, making the images match by the trivial method. Also, note that if you draw a rectangle around your images, then this method will tell they are all equal. In that case you will want to compare multiple hulls, discarding the points in the current hull and constructing other ones. Nevertheless, this method is able to group all your images correctly given they are all very simple and mostly noisy-free.
From as much as I can get, the difficulty is when the shape is the same, just size is different. A simple hack approach could be:
- subtract the images, then erode. If the shapes were the same but one slightly bigger, subtracting will leave only the edges, which will be thin an vanish with erosion as noise.
Somewhat more formal, would be to take the contours and then the approximate polygons and do a invariants comparison (Hu Moments etc.)

Image Processing for recognizing 2D features

I've created an iPhone app that can scan an image of a page of graph paper and can then tell me which squares have been blacked out and which squares are blank.
I do this by scanning from left to right and use the graph paper's lines as guides. When I encounter a graph paper line, I start to look for black, until I hit the graph paper line again. Then, instead of continuing along the scan line, I go ahead and completely scan the square for black. Then I continue on to the next box. At the end of the line, I skip down so many pixels before starting the scan on a new line (since I have already figured out how tall each box is).
This sort of works, but there are problems. Sometimes I mistake the graph lines as "black". Sometimes, if the image is skewed, or I don't have uniform lighting across the page, then I don't get good results.
What I'd like to do is to specify a few "alignment" boxes that I then resize and rotate (and skew) the picture to align with those. Then, I was thinking that once I have the image aligned, I would then know where all the boxes are and won't have to scan for the boxes, just scan inside the location of the boxes to see if they are black. This should be faster and more reliable. And if I were to operate on images coming from the camera, I'd have more flexibility in asking the user to align the picture to match the alignment marks, rather than having to align the image myself.
Given that this is my first Image Processing project, I feel like I am reinventing the wheel. I'd like suggestions on how to do this, and whether to utilize libraries like OpenCV.
I am enclosing an image similar to what I would like processed. I am looking for a list of all squares that have a significant amount of black marking, i.e. A8, C4, E7, G4, H1, J9.
Issues to be aware of:
Light coverage of the image may not be ideal, but should be relatively consistent across the image (i.e. no shadows)
All squares may be empty or all dark, and the algorithm needs to be able to determine that
the image may be skewed or rotated about any of the axis. Rotation about the z axis maybe easy to fix. There may be rotation around the x or y axis making ones side of the image be wider than the other. However, if I scan the image in realtime as it comes from the camera, I can ask the user to align the alignment marks with marks on the screen. How best to ensure that alignment to give the user appropriate feedback? Just checking to make sure that the 4 corners are dark could result in a false positive when the camera is pointing to a black surface.
not every square will be equally or consistently blacked, but I think there will be enough black to make it unquestionable to a human eye.
the blue grid may be useful, but there are cases where the black markings may overlap the blue grid. I think a virtual grid is probably better than relying on the printed grid. I would think that using the alignment markers to align the image, would then allow for a precise virtual grid to be laid out. And then the contents of each grid box could be sampled, to see if it was predominantly black, vs scanning from left-to-right, no? Here is another image with more markings on the grid. In this image, in addition to the previous marking in A8, C4, E7, G4, H1, J9, I have marked E2, G8 and G9, and I4 and J4 and you can see how the blue grid is obscured.
This is my first phase of this project. Eventually I'd like to scale this algorithm to be able to process at least a few hundred slots and possibly different colors.
To start with, this problem reminded me a bit of these demo's that might be useful to learn from:
The DNA microarray image processing
The Matlab Sudoku solver
The Iphone Sudoku solver blog post, explaining the image processing
Personally, I think the most simple approach would be to detect the squares in your image.
1) Remove the background and small cruft
f_makebw = #(I) im2bw(I.data, double(median(I.data(:)))/1.3);
bw = ~blockproc(im, [128 128], f_makebw);
bw = bwareaopen(bw, 30);
2) Remove everything but the squares and circles.
se = strel('disk', 5);
bw = imerode(bw, se);
% Detect the squares and cricles via morphology
[B, L] = bwboundaries(bw, 'noholes');
3) Detect the squares using 'extend' from regionprops. The 'Extent' metric measures what proportion of the bounding-box is filled. This makes it a
nice measure to distinguish between circles and squares
stats = regionprops(L, 'Extent');
extent = [stats.Extent];
idx1 = find(extent > 0.8);
bw = ismember(L, idx1);
4) This leaves you with your features, to synchronize or rectify the image with. An easy, and robust way, to do this, is via the Autocorrelation Function.
This gives nice peaks, which are easily detected. These peaks can be matched against the ACF peaks from a template image via the Hungarian algorithm. Once matched, you can correct rotation and scaling as you now have a linear system which you can solve:
x = Ax'
Translation can then be corrected using run-of-the-mill cross correlation against the same pre defined template.
If all goes well, you know have an aligned or synchronized image, which should help considerably in determining the position of the dots.
I've been starting to do something similar using my GPUImage iOS framework, so that might be an alternative to doing all of this in OpenCV or something else. As it's name indicates, GPUImage is entirely GPU-based, so it can have some tremendous performance benefits over CPU-bound processing (up to 180X faster for doing things like processing live video).
As a first stage, I took your images and ran them through a simple luminance thresholding filter with a threshold of 0.5 and arrived at the following for your two images:
I just added an adaptive thresholding filter, which attempts to correct for local illumination variances, and works really well for picking out text. However, in your images it uses too small of an averaging radius to handle your blobs well:
and seems to bring out your grid lines, which it sounds like you wish to ignore.
Maurits provides a more comprehensive description of what you could do, but there might be a way to implement these processing operations as high-performance GPU-based filters instead of relying on slower OpenCV versions of the same calculations. If you could grab rotation and scaling information from this thresholded image, you could construct a transform that could also be applied as a filter to your thresholded image to produce your final aligned image, which could then be downsampled and read out by your application to determine which grid locations were filled in.
These GPU-based thresholding operations run in less than 2 ms for 640x480 frames on an iPhone 4, so it might be possible to chain filters together to analyze incoming video frames as fast as the device's video camera can provide them.

Resources