I'm interested in taking user stroke input (i.e. drawing with an iPad) and classifying it as either text or a drawing (or, I suppose, just non-text), in whatever capacity is reasonably feasible. I'm not expecting a pre-built library for this, I'm just having a hard time finding any papers or algorithmic resources about this.
I don't need to detect what the text is that they're drawing, just whether it's likely text or not.
I would think you will need to generate probabilities for what text character the input is. If the highest probability text character is below some threshold, classify the stroke as drawing.
This is a possible useful paper: http://arxiv.org/pdf/1304.0421v1.pdf (if only for its reference list). Also the first hit on this google scholar search looks relevant: http://scholar.google.com/scholar?q=classification+stroke+input+text+or+drawing
Related
I want to train a neural net to recognize text only a single font on paper (Times Roman). Can I get away with only a single training sample for each character?
My rationale is that the font would not vary as opposed to a hand written one. The only thing that may change is the angle, and brightness which I can clean up before inferring it after my model is trained. Or am I missing something?
It depends on where your input is coming from. If the input is going to be screenshots and the font is always the same, (including font-size, boldness, etc.) and always in the same colours. Then you can probably get away with only one set.
If you are dealing with input from scanners or photos then you probably will end up with partially cut-off characters, warped characters from bends in the page or off-angle 3d photography, smudges on the page, and millions of other minor differences. You can try to clean them up before sending it to OCR, but your cleaner function will need to be more advanced than your OCR function for that to work so it's probably easier to just use a bunch of different learning sets for your OCR neural net.
i am looking for a technique or a known method to search a part of similar segments in a handwritten text.
its a kind of image retrieval, but rather than searching for an entire word or character, i want to search similar parts of strokes given a pattern as input image.
the figure below illustrate this process, where the red segments are input images and red rectangles represent part of text similar to the input.
by "similar", i mean "approximately", not exact matching
thanks in advance
I believe Shape Context can be helpful for your task. You can read more about it here.
Shape context descriptor allows you to robustly describe a local shape. Finding two shape-context descriptors that are nearly identical strongly suggests that the underlying text pattern are quite similar.
You can find Matlab implementation in the project's home page.
I'm trying to detect objects and text in a hand-drawn diagram.
My goal is to be able to "parse" something like this into an object structure for further processing.
My first aim is to detect text, lines and boxes (arrows etc... are not important (for now ;))
I can do Dilatation, Erosion, Otsu thresholding, Invert etc and easily get to something like this
What I need some guidance for are the next steps.
I've have several ideas:
Contour Analysis
OCR using UNIPEN
Edge detection
Contour Analysis
I've been reading about "Contour Analysis for Image Recognition in C#" on CodeProject which could be a great way to recognize boxes etc. but my issue is that the boxes are connected and therefore do not form separate objects to match with a template.
Therefore I need some advises IF this is a feasible way to go.
OCR using UNIPEN
I would like to use UNIPEN (see "Large pattern recognition system using multi neural networks" on CodeProject) to recognize handwritten letters and then "remove" them from the image leaving only the boxes and lines.
Edge detection
Another way could be to detect all lines and corners and in that way infer the boxes and lines that the image consist of. In that case ideas on how to straighten the lines and find the 90 degree corners would be helpful.
Generally, I think I just need some pointers on which strategy to apply, not code samples (though it would be great ;))
I will try to answer about the contour analysis and the lines between them.
If you need to turn the interconnected boxes into separate objects, that can be achieved easily enough:
close the gaps in the box edges with morphological closing
perform connected components labeling and look for compact objects (e.g. objects whose area is close to the area of their bounding box)
You will get the insides of the boxes. These can be elliptical or rectangular or any shape you may find in common diagrams, the contour analysis can tell you which. A problem may arise for enclosed background areas (e.g. the space between the ABC links in your example diagram). You might eliminate these on the criterion that their bounding box overlaps with multiple other objects' bounding boxes.
Now find line segments with HoughLinesP. If a segment finishes or starts within a certain distance of the edge of one of the objects, you can assume it is connected to that object.
As an added touch you could try to detect arrow ends on either side by checking the width profile of the line segments in a neighbourhood of their endpoints.
It is an interesting problem, I will try to remember it and give it to my students to grit their teeth on.
What method is suitable to capture (detect) MRZ from a photo of a document? I'm thinking about cascade classifier (e.g. Viola-Jones), but it seems a bit weird to use it for this problem.
If you know that you will look for text in a passport, why not try to find passport model points on it first. Match template of a passport to it by using ASM/AAM (Active shape model, Active Appearance Model) techniques. Once you have passport position information you can cut out the regions that you are interested in. This will take some time to implement though.
Consider this approach as a great starting point:
Black top-hat followed by a horisontal derivative highlights long rows of characters.
Morphological closing operation(s) merge the nearby characters and character rows together into a single large blob.
Optional erosion operation(s) remove the small blobs.
Otsu thresholding followed by contour detection and filtering away the contours which are apparently too small, too round, or located in the wrong place will get you a small number of possible locations for the MRZ
Finally, compute bounding boxes for the locations you found and see whether you can OCR them successfully.
It may not be the most efficient way to solve the problem, but it is surprisingly robust.
A better approach would be the use of projection profile methods. A projection profile method is based on the following idea:
Create an array A with an entry for every row in your b/w input document. Now set A[i] to the number of black pixels in the i-th row of your original image.
(You can also create a vertical projection profile by considering columns in the original image instead of rows.)
Now the array A is the projected row/column histogram of your document and the problem of detecting MRZs can be approached by examining the valleys in the A histogram.
This problem, however, is not completely solved, so there are many variations and improvements. Here's some additional documentation:
Projection profiles in Google Scholar: http://scholar.google.com/scholar?q=projection+profile+method
Tesseract-ocr, a great open source OCR library: https://code.google.com/p/tesseract-ocr/
Viola & Jones' Haar-like features generate many (many (many)) features to try to describe an object and are a bit more robust to scale and the like. Their approach was a unique approach to a difficult problem.
Here, however, you have plenty of constraint on the problem and anything like that seems a bit overkill. Rather than 'optimizing early', I'd say evaluate the standard OCR tools off the shelf and see where they get you. I believe you'll be pleasantly surprised.
PS:
You'll want to preprocess the image to isolate the characters on a white background. This can be done quite easily and will help the OCR algorithms significantly.
You might want to consider using stroke width transform.
You can follow these tips to implement it.
Algorithm for a drawing and painting robot -
Hello
I want to write a piece of software which analyses an image, and then produces an image which captures what a human eye perceives in the original image, using a minimum of bezier path objects of varying of colour and opacity.
Unlike the recent twitter super compression contest (see: stackoverflow.com/questions/891643/twitter-image-encoding-challenge), my goal is not to create a replica which is faithful to the image, but instead to replicate the human experience of looking at the image.
As an example, if the original image shows a red balloon in the top left corner, and the reproduction has something that looks like a red balloon in the top left corner then I will have achieved my goal, even if the balloon in the reproduction is not quite in the same position and not quite the same size or colour.
When I say "as perceived by a human", I mean this in a very limited sense. i am not attempting to analyse the meaning of an image, I don't need to know what an image is of, i am only interested in the key visual features a human eye would notice, to the extent that this can be automated by an algorithm which has no capacity to conceptualise what it is actually observing.
Why this unusual criteria of human perception over photographic accuracy?
This software would be used to drive a drawing and painting robot, which will be collaborating with a human artist (see: video.google.com/videosearch?q=mr%20squiggle).
Rather than treating marks made by the human which are not photographically perfect as necessarily being mistakes, The algorithm should seek to incorporate what is already on the canvas into the final image.
So relative brightness, hue, saturation, size and position are much more important than being photographically identical to the original. The maintaining the topology of the features, block of colour, gradients, convex and concave curve will be more important the exact size shape and colour of those features
Still with me?
My problem is that I suffering a little from the "when you have a hammer everything looks like a nail" syndrome. To me it seems the way to do this is using a genetic algorithm with something like the comparison of wavelet transforms (see: grail.cs.washington.edu/projects/query/) used by retrievr (see: labs.systemone.at/retrievr/) to select fit solutions.
But the main reason I see this as the answer, is that these are these are the techniques I know, there are probably much more elegant solutions using techniques I don't now anything about.
It would be especially interesting to take into account the ways the human vision system analyses an image, so perhaps special attention needs to be paid to straight lines, and angles, high contrast borders and large blocks of similar colours.
Do you have any suggestions for things I should read on vision, image algorithms, genetic algorithms or similar projects?
Thank you
Mat
PS. Some of the spelling above may appear wrong to you and your spellcheck. It's just international spelling variations which may differ from the standard in your country: e.g. Australian standard: colour vs American standard: color
There is an model that can implemented as an algorithm to calculate a saliency map for an image, determining which parts of the image would get the most attention from a human.
The model is called itti koch model
You can find a startin paper here
And more resources and c++ sourcecode here
I cannot answer your question directly, but you should really take a look at artist/programmer (Lisp) Harold Cohen's painting machine Aaron.
That's quite a big task. You might be interested in image vectorizing (don't know what it's called officially), which is used to take in rasterized images (such as pictures you take with a camera) and outputs a set of bezier lines (i think) that approximate the image you put in. Since good algorithms often output very high quality (read: complex) line sets you'd also be interested in simplification algorithms which can help enormously.
Unfortunately I am not next to my library, or I could reccomend a number of books on perceptual psychology.
The first thing you must consider is the physiology of the human eye is such that when we examine an image or scene, we are only capturing very small bits at a time, as our eyes dart around rapidly. Our mind peices the different parts together to try and form a whole.
You might start by finding an algorithm for the path of an eyeball as it darts around. Perhaps it is attracted to contrast?
Next is that our eyes adjust the "exposure" depending on the context. It's like those high dynamic range images, if they were peiced together not by multiple exposures of a whole scene, but by many small images, each balanced on its own, but blended into its surroundings to form a high dynamic range.
Now there was a finding in a monkey brain that there is a single neuron that lights up if there's a diagonal line in the upper left of its field of vision. Similar neurons can be found for vertical lines, and horizontal lines in various areas of that monkey's field of vision. The "diagonalness" determines the frequency with which that neuron fires.
one might speculated that other neurons might be found and mapped to other qualities such as redness, or texturedness, and other things.
There's something humans can do that I've not seen a computer program ever able to do. it's something called "closure", where a human is able to fill in information about something that they are seeing, that doesn't actually exist in the image. an example:
*
* *
is that a triangle? If you knew that it was in advance, then you could probably make a program to connect the dots. But what if it's just dots? How can you know? I wouldn't attempt this one unless I had some really clever way of dealing with that one.
There are many other facts about human perception you might be able to use. Good luck, you've not picked a straightforward task.
i think a thing that could help you in this enormous task is human involvement. i mean data. like you could have many people sitting staring at random dots (like from the previous post) and connect them as they see right. you could harness that data.