I am trying to scan a business card using tesseract OCR, all I am doing is sending the image in with no preprocessing, heres the code I am using.
Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:#"eng+ita"];
tesseract.delegate = self;
[tesseract setVariableValue:#"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#.-()" forKey:#"tessedit_char_whitelist"];
[tesseract setImage:[UIImage imageNamed:#"card.jpg"]]; //image to check
[tesseract recognize];
NSLog(#"Here is the text %#", [tesseract recognizedText]);
Picture of card
This is the output
As you can see the accuracy is not 100%, which is not what I am concerned about I figure I can fix that with some simple per-processing. However if you notice it mixes the two text blocks at the bottom, which splits up the address, and possibly other information on other cards.
How can I possibly use Leptonica(or something else maybe OpenCV) to group the text somehow? Possibly send regions of text on the image individually to tesseract to scan?
I've been stuck on this problem for a while any possible solutions are welcome!
I would recommend using an algorithm called "Run Length Smoothing Algorithm" (RLSA). This algorithm is used in a lot of document image processing systems, though not every system expose it as part of its API.
The original paper was published in 1982 and requires payment. However, the same algorithm is cited by many other papers on document image processing, where you can easily find implementation details and improvements.
One such paper is this: http://www.sciencedirect.com/science/article/pii/S0262885609002005
The basic idea is to scan the document image row by row, recording the width of the gaps between letters.
Then, nearby text characters can be combined by filtering on the width of the gaps, and setting small gaps to the same color as the text. The result will be large connected components that represent:
Words,
By closing the gaps between characters,
Text lines,
By closing the gaps between words, and
Paragraphs
By scanning column by column and then closing the vertical gaps between text lines.
If you do not have access to any document image analysis libraries that expose this functionality, you can mimic the effect by:
Using morphological operations (morphological closing), and then
Perform connected-component labeling on the result.
Most image processing libraries, such as OpenCV, provides such functionality. It might be less efficient to take this approach because you will have to re-run the algorithm using different text gap sizes to achieve the different levels of clustering, unless the user provides your application with the text gap sizes.
I think you've hit on a fundamental problem with OCR - printed designs of this type use white space as meaningful delimiters, but the OCR software doesn't/can't understand that.
This is just a wild stab in the dark, but here's what I would try:
Starting in the upper left, build a box perhaps 1-5% of the size of the whole image. Send that to OCR, and see if you get what looks meaningful back. If not, expand until you get something.
Once you have something, start expanding the block in reasonable units until you stop getting new data back. You can, hopefully, decide this point is "meaningful white space", and now you can consider this processed text as "one block" and thus complete. Now start with whatever the next unprocessed part of the image is, and thus work your way through until you've got the whole image complete.
By working with a set of interlinking expanding boxes, the hope is you'll only get meaningful blocks of data grouped together. Working with your example, once you isolate the logo and process it (and the resulting gibberish), the next box will start with, say, the "N" in Noah. Then you expand out to the right until you've gotten the whole name.
With this done you go again and, hopefully, you'll get a bounding box that includes the "A" in Associate, and get that whole line.
A pixel at a time this would take too long with all those runs to the OCR, I'm sure, but there will surely be a trade-off in "size of chunks to expand per interval" and "amount of processing required".
I don't see why this approach wouldn't work for relatively normal print designs, like a regular style business card.
You can try HOCRText which return all the scanned words along with frame of each word in that image as an xml.
char *boxtext = _tesseract->GetHOCRText(0);
You can parse that xml to get each word and its frame.
Else if you need you can mention the frame in image which should tesseract scan.
_tesseract->SetRectangle(100, 100, 200, 200);
Set this frame before you call recognise. So tesseract will scan only that frame and return text at that frame.
There is a sample iOS application on Github that does this which might be helpful for you:
https://github.com/danauclair/CardScan
How does he read the business card? He writes the following, (or you can read it in the file: https://github.com/danauclair/CardScan/blob/master/Classes/CardParser.m )
// A class used to parse a bunch of OCR text into the fields of an ABRecordRef which can be added to the
// iPhone address book contacts. This class copies and opens a small SQLite databse with a table of ~5500
// common American first names which it uses to help decipher which text on the business card is the name.
//
// The class tokenizes the text by splitting it up by newlines and also by a simple " . " regex pattern.
// This is because many business cards put multiple "tokens" of information on a single line separated by
// spaces and some kind of character such as |, -, /, or a dot.
//
// Once the OCR text is fully tokenized it tries to identify the name (via SQLite table), job title (uses
// a set of common job title words), email, website, phone, address (all using regex patterns). The company
// or organization name is assumed to be the first token/line of the text unless that is the name.
//
// This is obviously a far from perfect parsing scheme for business card text, but it seems to work decently
// on a number of cards that were tested. I'm sure a lot of improvements can be made here.
Related
I'm trying to detect elements from an electonic circuit based on binary images. Therefore I have to separate it into parts. Each part should describe one element, e.g. a resistor or a capacity. I also want to detect branchpoints, where multiple line (or multiple elements) are connected.
The following picture shows an example circuit, which contains two resistors and two branch-points: Example Circuit with two resistors:
.
Thats what I want my program to detect automatically.
I already implemented an algorithm which is able to detect line segments and branchpoints, when the input-image contains lines with 1px linewidth.
The problem is transforming an image into this 1px linemodel. Some like this:
Does anyone know how to do it?
Thanks in advance!
Niklas
In Matlab you can use the following code
% Read image
I = double(imread('circit.png'));
I = I(:,:,1);
% Run thining opreation
IThin = bwmorph(~I,'thin',Inf);
% Show image
imshow(IThin)
And the resulted image is:
Example Image
I want to remove the lines (shown in RED color) as they are out of order. Lines shown in black color are repeating at same period (approximately). Period is not known beforehand. Is there any way of deleting non-periodic lines( shown in red color) automatically?
NOTE: Image is binary ( back & while).. lines shown in red color only for illustration.
Of course there is any way. There is almost always some way to do something.
Infortunately you have not provided any particular problem. The entire thing is too broad to be answered here.
To help you getting started: (I highly recommend you start with pen, paper and your brain)
Detect the lines -> google or think, there are many standard ways to detect lines in an image. if you don't have noise in your binary image its trivial.
find any aequidistant sets -> think
delete the rest -> think ( you know what is good so everything else has to go away)
I assume, your lines are (almost) vertical.
The following should work
turn the image to a column sum histogram
try a Fourier transformation on the signal (potentially padding the image appropriately)
pick the maximum/peak from the Fourier spectrum as your base period
If you need the lines rather than the position of the lines, generate a mask with lines at appropriate intervals (as determined by your analysis before) and apply to the image.
I have an image with a group of cells and I need to count them. I did a similar exercise using bwlabel, however this one is a bit more challenging because there are some little cells that I don't want to count. In addition, some cells are on top of each other. I've seem some MATLAB examples online but they all involved functions that aren't available. Do you have any ideas how to separate the overlapping cells?
Here's the image:
To make it clearer: Please help me count the number of red blood cells (which have a circular shape) like so:
The image is in grayscale but I think you can distinguish which ones are red blood cells. They have a distinctive biconcave shape... Everything else doesn't matter. But to be more specific here is an image with all the things that I want to ignore/discard/not count highlighted in red.
The main issue is the overlapping of cells.
The following is an ImageJ macro to do this (which is free software too). I would recommend you use ImageJ (or Fiji), to explore this type of stuff. Then, if you really need it, you can write an Octave program to do it.
run ("8-bit");
setAutoThreshold ("Default");
setOption ("BlackBackground", false);
run ("Convert to Mask");
run ("Fill Holes");
run ("Watershed");
run ("Analyze Particles...", "size=100-Infinity exclude clear add");
This approach gives this result:
And it is point and click equivalent as:
Image > Type > 8-bit
Image > Adjust > Threshold
select "Default" and untick "dark background" on the threshold dialogue. Then click "Apply".
Process > Binary > Fill holes
Process > Binary > Watershed
Analyze > Analyze particles...
7 Set "100-Infinity" as range of valid particle size on the "Analyze particles" dialogue
On ImageJ, if you have a bianry image, watershed actually performs the distance transform, and then the watershed.
Octave has all the functions above except watershed (I plan on implementing it soon).
If you can't use ImageJ for your problem (why not? It can run in headless mode too), then an alternative is to get the area of each object, and if too high, then assume it's multiple cells. It kinda of depends on your question and if can generate a value for average cell size (and error).
Another alternative is to measure the roundness of each object identified. Cells that overlap will be less round, you can identify them that way.
It depends on how much error are you willing to accept on your program output.
This is only to help with "noise" but why not continue using bwlabel and try using bwareaopen to get rid of small objects? It seems the cells are pretty large, just set some size threshold to get rid of small objects http://www.mathworks.com/matlabcentral/answers/46398-removing-objects-which-have-area-greater-and-lesser-than-some-threshold-areas-and-extracting-only-th
As for overlapping cells, maybe setting an upperbound for the size of a single cell. so when you have two cells overlapping, it will classify this as "greater than one cell" or something like that. so it at least acknowledges the shape, but can't determine exactly how many cells are there
I have looked through many tutorials and usually stack users trow links to the pdfkitten, but as I've tested it I have not satisfied with result. So the search does not work with multiply word and etc.
So what I am looking for I need to get all words from the pdf page and highlight it if the words cross some rectangle.
I used PDFKitten for the same.
What I did was while scanning the PDF - Identify the words separated
by spaces.
Save the RenderingState(Model in PDFKitten code)word is
encountered save that word in a model with it's current
RenderingState (Model in PDFKitten code) which will be initial state.
When the complete word is found(space separated) again save the
current RenderingState as final state.
The code for converting RenderingState to actual view's frame using
above initial state and final state, is present in PDFKitten. You can
refer to that code.
apply current media box transform to frame.
And finally don't forget
to convert resulted frame into user's co-ordinate system. Otherwise
you will observe the reverse effect.
How can I fully justify a block of text (like MS Word does, not only on the right and not only on the left but on both sides)?
I want to justify some texts (mainly arabic text) adjusted to certain screen size (some handheld device screen actually, and its text viewer doesn't have this function) and save this text as justified. So I can reload and reuse it again elsewhere.
(The problem with MS word is, that if you copy the justified text from MS Word and paste it to another editor it'll copy it un-justified).
Update : for now I'm thinking of doing it like this:
get-a-word
get-word-width
add-word-to-total-Word and add-Word-width-to-total-word-width
check if total-Word-width = myscreen-width then continue
else if total-Word-width is between myscree-wdith and (myscreen-width -3) then
add-spaces-To-total-word until it = myscreen-width
This is what I'm thinking now, but I put this question up and hope to see if there is a better solution, or somebody else already implemented it.
PS: I hope I have made my question clear and I'm sorry for bad expression if there is.
edit1 : changed the title to make it more clear.
If you want to justify plain text, you can only add extra spaces to the lines to get them align on the left and right. Unfortunately the character widths differ in fonts; so doing it this way will only work for a certain font, unless you limit yourself to monospaced fonts where all characters have the same size.
If you want a result like in Word, adding spaces won't cut it. Word will not add spaces, but stretch and shrink the existing spaces. This information is lost when you copy and paste it into another app.
Either way, justifying is an optimization problem. If you are interested in a good solution and its implementation: have a look a TeX. For an implementation that works on plain text with monospaced fonts have a look at par
There are some API calls that may help:
ExtTextOut and GetCharacterPlacement
Look at the GCP_JUSTIFY flag for GetCharacterPlacement
ExtTextOut is used by Canvas.TextRect
The problem you are going to face is always going to be differences in the rendering of the font. Word handles full justification by adjusting kerning as well as adjusting the number of pixels between words by a few (either way). The end result is lined up both margins. This pixel adjustment is done BOTH ways, and as evenly as possible.
To properly handle this in your portable device you will have to also perform the same algorithm for the display of the text there.
If this is not possible, then the ONLY way you can even get somewhat close would be to add whitespace between words.
As has been pointed out in other answers Word does full justification by stretching the existing spaces often by very small amounts. This is only possible if you have full control over how your text is drawn on the screen (which word - or any other windows program has).
You only real option in this regard would be to implement your own text viewer on the platform you are targeting. Eg you would need to draw the text on the screen yourself (any platform that allows games should allow you to draw on the screen). However this seems like an awful lot of trouble to get justified text.
Sorry couldn't be of more help.