How to find brightest rectangle of certain size in integral image? - image-processing

Is there anything faster than sliding window? I tried sort of binary search with overlapping rectangles - it kinda works but sometimes cuts off part of the blob (expected, right) - see the video in

Binary search makes no sense, because it is an algorithm for searching for specific values in a sorted structure.
Unless you have some apriori knowledge about the image, you need to check all possible locations, which is the sliding window method you suggested.

Chris is correct, unless you can say something about the statistics of the surrounding regions, e.g., "certain arrangements of pixels around the spot I'm looking for are unlikely". Note, this is different from saying "will never happen", and any algorithm based on statistical approaches will have an associated probability of (wrong box found).
If you think the statistics of the larger regions around your desired location might be informative, you might be able to do some block-processing on larger blocks before doing the fine-level sliding window. For example, if you can say with high probability that a certain 64 x 64 region doesn't contain the max, then, you can throw out a lot of [64 x 64] pixel regions, with 32 pixel overlap using (maybe) only a few features.
You can train something like AdaBoost to do this. See the classic Viola-Jones work which does this for face-detection
If you absolutely need the maxima location, then like Chris said, you need to search everywhere.


How to count tablets successfully?

My last question on image recognition seemed to be too broad, so I would like to ask a more concrete question.
First the background. I have already developed a (round) pill counter. It uses something similar to this tutorial. After I made it I also found something similar with this other tutorial.
However my method fails for something like this image
Although the segmentation process is a bit complicated (because of the semi-transparency of the tablets) I have managed to get it
My problem is here. How can I count the elongated tablets, separating each one from the image, similar to the final results in the linked tutorials?
So far I have applied distance transform and then my own version of watershed and I got
As you can see it fails in the adjacent tablets (distance transform usually does).
Take into account that the solution does have to work for this image and also for other arrangements of the tablets, the most difficult being for example
I am open to use OpenCV or if necessary implement on my own algorithms. So far I have tried both (used OpenCV functions and also programmed my own libraries) I am also open to use C++, or python or other. (I programmed them in C++ and I have done it on C# too).
I am also working on this pill counting problem (I'm much earlier in this process than you are), and to solve the piece you are working on - of touching pills, my general idea how to solve this is to capture contours of the pills once you have a good mask of the pills, and then calculate the area of a single pill.
For this approach I'm assuming that I have enough pills in the image such that the amount of them that are untouching is greater than those which are touching, and no pills overlap one another. For my application, placing this restriction I think is reasonable (humans can do a quick look at the pills they've dumped out, and at least roughly make them not touching without too much work. It's also possible that I could design a tray with some sort of dimples in it such that it would coerce the pills to not be touching)
I do this by sorting the contour areas (which, with the right thresholding should lead to only pills and pill-groups being in the identified contours), and taking the median value.
Then, with a good value for the area of a pill, you can look for contours with areas that are a multiple of that median area (+/- some % error value).
I also use that median value to filter out contours that are clearly not big enough to be pills, and ones that are far too large to be a pill (the latter though could be more troublesome, since it could still be a grouping of touching pills).
Given that the pills are all identical and don’t overlap, simply divide the total pill area by the area of a single pill.
The area is estimated simply counting the number of “pill” pixels.
You do need to calibrate the method by giving it the area of a single pill. This can be trivially obtained by giving the correct solution to one of the images (manual counting), then all the other images can be counted automatically.

Image similarity of apartment photos

I want to design an algorithm that would find matches in images of the same apartment, when put up by different real estate agents.
Photos are relatively taken in similar time so the interior of the rooms should not change that much but of course every guys takes different pictures from different angles, etc.
(TLDR; a apartment goes for sale, and different real estate guys come in and make their own pictures, and I want to know if the given pictures from various guys are of the same place)
I know that image processing and recognition algorithm selections highly depend on the use case, so could you point me in correct direction given my use-case?
You can actually use Clarifai's Custom Training API endpoint, fairly simple and straightforward. All you would have to do is train the initial image and then compare the second to it. If the probability is high, it is likely the same apartment. For example:
In javascript, to declare a positive it is:
clarifai.positive('', 'firstapartment', callback);
And a negative is:
clarifai.negative('', 'firstapartment', callback);
You don't necessarily have to do a negative, but it could only help. Then, when you are comparing images to the first aparment, you do:
clarifai.predict('', 'firstapartment', callback);
This will give you a probability regarding the likeness of the photo to what you've trained ('firstapartment'). This API is basically doing machine learning without the hassle of the actual machine. Clarifai's API also has a tagging input that is extremely accurate with some basic tags. The API is free for a certain number of calls/month. Definitely worth it to check out for this case.
As user Shaked mentioned in a comment, this is a difficult problem. Even if you knew the position and orientation of each camera in space, and also the characteristics of each camera, it wouldn't be a trivial problem to match the images.
A "bag of words" (BoW) approach may be of use here. Rather than try to identify specific objects and/or deduce the original 3D scene, you determine what "feature descriptors" can distinguish objects from one another in your image sets.
Imagine you could describe the two images by the relative locations of textures and colors:
horizontal-ish line segments at far left
red blob near center left
green clumpy thing at bottom left
bright round object near top left
then for a reasonably constrained set of images (e.g. photos just within a certain zip code), you may be able to yield a good match between the two images above.
The Wikipedia article on BoW may look a bit daunting, but I think if you hunt around you'll find an article that describes "bag of words" for image processing clearly. I've seen a very good demo of a BoW approach used to identify objects such as boats and delivery vans in arbitrary video streams, and it worked impressively well. I wish I had a copy of the presentation to pass along.
If you don't suspect the image to change much, you could try the standard first step of any standard structure-from-motion algorithm to establish a notion of similarity between a pair of images. Any pair of images are similar if they contain a number of matching image features larger than a threshold which satisfy the geometrical constraint of the scene as well. For a general scene, that geometrical constraint is given by a Fundamental Matrix F computed using a subset of matching features.
Here are the steps. I have inserted the opencv method for each step, but you could write your methods too:
Read the pair of images. Use img = cv2.imread(filename).
Use SIFT/SURF to detect image features/descriptors in both images.
sift = cv2.xfeatures2d.SIFT_create()
kp, des = sift.detectAndCompute(img,None)
Match features using the descriptors.
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1,des2)
Use RANSAC to compute funamental matrix.
cv2.findFundamentalMatrix(pts1, pts2, cv2.FM_RANSAC, 3, 0.99, mask)
mask contains all the inliers. Simply count them to determine if the number of matches satisfying geometrical constraint is large enough.
CAUTION: In case of a planar scene, we use homography instead of a fundamental matrix and the steps described above work out pretty nicely because homography takes a point to a corresponding point in the other image. However, Fundamental matrix takes a point to the corresponding epipolar line in the other image, which makes the entire process a bit less stable. So I would recommend trying these steps a few more times with a little bit of jitter to the feature locations and collating the evidence over more than one trial to make the decision. You can also use more advanced steps to introduce robustness to this process but only if the steps described above don't yield the results you need.

OCR: segmentation of small text

The problem
I've been building a (very) simple OCR engine.
Since I'm trying to classify very small (pixel size) characters, I'm having some difficulties on segmentation. Here's an example, after best-effort image-wide thresholding:
What I've tried
Error detection:
large horizontal size of the segments. It works, mostly, but fails (false positive)
for a few larger characters.
classify, and reject on low score. This seems a bit wasteful.
Error correction:
add pixels vertically (vertical histogram), find minimum. It cuts many segments on the wrong place, in many of the samples.
What I haven't tried yet
Trying to classify on all possible segmentation points (pixels). This would be very wasteful, and be difficult to expand for a 3-merged-characters segment.
I've been reading up on morphology approaches to turn the characters into mathematical curves, but I don't know really know where to start, or if it's worth the effort
Where to go from here?
I have no idea. Hence this question :)
Lean back and half close your eyes.
63 :-)
Now, if only it was so easy for a computer!
It's tantalisingly close to what double-patterning does (or un-does?) in silicon masks.
I would suggest oversampling (doubling or quadrupling the pixel count in each axis), filtering (probably low pass - or possibly bandpass where the passband = spatial frequency of a line), re-thresholding until they separate. Expensive, so only apply in problem areas.
Reinvent your problem so you do not need segmentation.
Really, for this scale I think you better invest in other approaches. For example, if you OCR on text (do you?) you can use the information of lines (character height). There are not many fonts that can be used for small (yet readable) characters. My approach would be a algorithm that scan lines in scanlines (from left to right, take pixels from top to bottom) and try to find correlations between trained text and scanlines (n, n-1... n-x)
And you probably need the information I the grayscale levels as well, so better not to threshold the images.

EMGU OpenCV disparity only on certain pixels

I'm using the EMGU OpenCV wrapper for c#. I've got a disparity map being created nicely. However for my specific application I only need the disparity values of very few pixels, and I need them in real time. The calculation is taking about 100 ms now, I imagine that by getting disparity for hundreds of pixel values rather than thousands things would speed up considerably. I don't know much about what's going on "under the hood" of the stereo solver code, is there a way to speed things up by only calculating the disparity for the pixels that I need?
First of all, you fail to mention what you are really trying to accomplish, and moreover, what algorithm you are using. E.g. StereoGC is a really slow (i.e. not real-time), but usually far more accurate) compared to both StereoSGBM and StereoBM. Those last two can be used real-time, providing a few conditions are met:
The size of the input images is reasonably small;
You are not using an extravagant set of parameters (for instance, a larger value for numberOfDisparities will increase computation time).
Don't expect miracles when it comes to accuracy though.
Apart from that, there is the issue of "just a few pixels". As far as I understand, the algorithms implemented in OpenCV usually rely on information from more than 1 pixel to determine the disparity value. E.g. it needs a neighborhood to detect which pixel from image A map to which pixel in image B. As a result, in general it is not possible to just discard every other pixel of the image (by the way, if you already know the locations in both images, you would not need the stereo methods at all). So unless you can discard a large border of your input images for which you know that you'll never find your pixels of interest there, I'd say the answer to this part of your question would be "no".
If you happen to know that your pixels of interest will always be within a certain rectangle of the input images, you can specify the input image ROIs (regions of interest) to this rectangle. Assuming OpenCV does not contain a bug here this should speedup the computation a little.
With a bit of googling you can to find real-time examples of finding stereo correspondences using EmguCV (or plain OpenCV) using the GPU on Youtube. Maybe this could help you.
Disclaimer: this may have been a more complete answer if your question contained more detail.

An algorithm for a drawing and painting robot - any tips?

Algorithm for a drawing and painting robot -
I want to write a piece of software which analyses an image, and then produces an image which captures what a human eye perceives in the original image, using a minimum of bezier path objects of varying of colour and opacity.
Unlike the recent twitter super compression contest (see:, my goal is not to create a replica which is faithful to the image, but instead to replicate the human experience of looking at the image.
As an example, if the original image shows a red balloon in the top left corner, and the reproduction has something that looks like a red balloon in the top left corner then I will have achieved my goal, even if the balloon in the reproduction is not quite in the same position and not quite the same size or colour.
When I say "as perceived by a human", I mean this in a very limited sense. i am not attempting to analyse the meaning of an image, I don't need to know what an image is of, i am only interested in the key visual features a human eye would notice, to the extent that this can be automated by an algorithm which has no capacity to conceptualise what it is actually observing.
Why this unusual criteria of human perception over photographic accuracy?
This software would be used to drive a drawing and painting robot, which will be collaborating with a human artist (see:
Rather than treating marks made by the human which are not photographically perfect as necessarily being mistakes, The algorithm should seek to incorporate what is already on the canvas into the final image.
So relative brightness, hue, saturation, size and position are much more important than being photographically identical to the original. The maintaining the topology of the features, block of colour, gradients, convex and concave curve will be more important the exact size shape and colour of those features
Still with me?
My problem is that I suffering a little from the "when you have a hammer everything looks like a nail" syndrome. To me it seems the way to do this is using a genetic algorithm with something like the comparison of wavelet transforms (see: used by retrievr (see: to select fit solutions.
But the main reason I see this as the answer, is that these are these are the techniques I know, there are probably much more elegant solutions using techniques I don't now anything about.
It would be especially interesting to take into account the ways the human vision system analyses an image, so perhaps special attention needs to be paid to straight lines, and angles, high contrast borders and large blocks of similar colours.
Do you have any suggestions for things I should read on vision, image algorithms, genetic algorithms or similar projects?
Thank you
PS. Some of the spelling above may appear wrong to you and your spellcheck. It's just international spelling variations which may differ from the standard in your country: e.g. Australian standard: colour vs American standard: color
There is an model that can implemented as an algorithm to calculate a saliency map for an image, determining which parts of the image would get the most attention from a human.
The model is called itti koch model
You can find a startin paper here
And more resources and c++ sourcecode here
I cannot answer your question directly, but you should really take a look at artist/programmer (Lisp) Harold Cohen's painting machine Aaron.
That's quite a big task. You might be interested in image vectorizing (don't know what it's called officially), which is used to take in rasterized images (such as pictures you take with a camera) and outputs a set of bezier lines (i think) that approximate the image you put in. Since good algorithms often output very high quality (read: complex) line sets you'd also be interested in simplification algorithms which can help enormously.
Unfortunately I am not next to my library, or I could reccomend a number of books on perceptual psychology.
The first thing you must consider is the physiology of the human eye is such that when we examine an image or scene, we are only capturing very small bits at a time, as our eyes dart around rapidly. Our mind peices the different parts together to try and form a whole.
You might start by finding an algorithm for the path of an eyeball as it darts around. Perhaps it is attracted to contrast?
Next is that our eyes adjust the "exposure" depending on the context. It's like those high dynamic range images, if they were peiced together not by multiple exposures of a whole scene, but by many small images, each balanced on its own, but blended into its surroundings to form a high dynamic range.
Now there was a finding in a monkey brain that there is a single neuron that lights up if there's a diagonal line in the upper left of its field of vision. Similar neurons can be found for vertical lines, and horizontal lines in various areas of that monkey's field of vision. The "diagonalness" determines the frequency with which that neuron fires.
one might speculated that other neurons might be found and mapped to other qualities such as redness, or texturedness, and other things.
There's something humans can do that I've not seen a computer program ever able to do. it's something called "closure", where a human is able to fill in information about something that they are seeing, that doesn't actually exist in the image. an example:
* *
is that a triangle? If you knew that it was in advance, then you could probably make a program to connect the dots. But what if it's just dots? How can you know? I wouldn't attempt this one unless I had some really clever way of dealing with that one.
There are many other facts about human perception you might be able to use. Good luck, you've not picked a straightforward task.
i think a thing that could help you in this enormous task is human involvement. i mean data. like you could have many people sitting staring at random dots (like from the previous post) and connect them as they see right. you could harness that data.
