Computer Vision: Document analysis & Text detection methods (OCR)

Computer Vision: Document analysis & Text detection methods (OCR) - opencv

I'm looking for technique to detect text on document.
For example on plain .txt file it's easy: There are many libraries, API's & SDK's for image processing and usually they have methods implementing OCR's algorithms.
But discussing "complex" printed document (structure of the document is well known & deterministic), for example the summary page of pension program annually report: I want to extract only the "bottom line" number. I know there is the header in the top center, in the middle some table, in the bottom left some paragraph and in the bottom right the paragraph I'm looking for.
What is the approach to extrac text from the document grouped & associated with it's location on the document? The main task here is a technique analysing the structure of the document versus pre defined structure, and when we know that we are now working on some specific paragraph - Well from here it's easy - apply standard mentioned above OCR API and collect the data in your custom data structure.
For example linked document (page 1): What is the approach such that every time I apply pure OCR API I know exactly on what part from the pre defined template I work? The document template has:
Top section devided into 3 horizontal parts.
Middle section: Title and then first table, another title and then another table.
Bottom section: some text on right corner.
Thanks,

Related

Text boundary information in a page element

Some colleagues and I are working on the Google Slide Add-On called Refine. In this add-on we utilize the google slide API. To improve our product we are in need of detailed information of the position of the text within a box. Sometimes the text boundaries are outside of the actual page element (see figure 1), and other times it is within the boundaries of the page element overflowing(see figure 2).
Figure 1
Figure 2
What we would highly appreciate is data points on the text boundaries, see figure 3 and 4 for an example of what kind of boundary information we would like. The boundaries we would like information about are the dashed lines.
Figure 3
Figure 4
We've looked a lot into the api documentation but could not find any place this data points. It is natural for us to think that this could be part of TextRange, but as we understand it this service is currently missing.
We would highly appreciate this feature and I would love to give you some more information if wanted. With this feature in place, we are able to increase the value offering of our add-on a lot.

How to make a document with continuous text on odd pages only in latex?

Heey all,
For my thesis, I'm dealing with a lot of figures with experimental data in one chapter. To maintain an overview for the reader, I would like to have text on the odd pages and figures on even pages. This way, when printing the document all text would be printed on the right side of the opened booklet and the figures on the left.
Anyone an idea how to create this in latex?

How to label boxed text in markdown?

My question in short is: How can you create a boxed text with a label that can be referenced?
Background: I am generating LaTeX output from a Markdown document to be included in a larger LaTeX document. I would like to describe the steps of an algorithm as boxed text with a label that can be referenced. I know how to create a labeled figure and how to create boxed text, but I haven't been able to figure out how to combine the two, i.e. how to label the boxed text as if it was a figure, or how to include the text in a figure (other than converting it to an image, which I'd like to avoid).
An initial "solution": Just putting a the box and an empty figure next to each other (see below) kind of works, except that nothing ensures that the figure label won't float away from the box as I work on the document, since figures are floating objects while text boxes are part of the text, and the two are handled differently by LaTeX. Moreover, you may need to use LaTeX vertical space commands to make it look reasonably good, but it is hard to get it perfect. Is there a simple solution? Thanks!
P.S. I know that I could just switch to LaTeX and figure out a solution there, but here I am looking for a solution in Markdown, possibly making use of some embedded LaTeX commands.
You can see the algorithm in Figure \ref{methods:estimating}.
\fbox{\parbox{5in}{
1. Initialize $b_r=0$ for $r=1..R$ \\
2. For each item $i, i=1..U$, calculate ... \\
3. Re-estimate ... \\
4. Proceed to Step 2 until it converges.
}}
![Estimating ... \label{methods:estimating}]()
It is rendered like this:

You can use one of the packages for writing algorithms. See https://www.sharelatex.com/learn/algorithms.

Create aesthetically pleasing ring chart programmatically in LaTeX

I'm not sure if this is the right place to ask. Please point me to the right direction if it isn't!
I found a LaTeX resume template here:
https://github.com/opensorceror/Data-Engineer-Resume-LaTeX
Although the Skills bubbles and Interests bars are being created programmatically, the Languages ring chart looks like a simple .png image that has been inserted into the document.
Is it possible to create a replica of this chart programmatically in LaTeX? A simple code example would be appreciated!
Note: The reason I wish to create it programmatically is because employers may use automatic parsers to identify skills keywords in a resume, and such keywords would be impossible to extract from a .png image.

Scan Business Card Tesseract and Leptonica iOS

I am trying to scan a business card using tesseract OCR, all I am doing is sending the image in with no preprocessing, heres the code I am using.
Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:#"eng+ita"];
tesseract.delegate = self;
[tesseract setVariableValue:#"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#.-()" forKey:#"tessedit_char_whitelist"];
[tesseract setImage:[UIImage imageNamed:#"card.jpg"]]; //image to check
[tesseract recognize];
NSLog(#"Here is the text %#", [tesseract recognizedText]);
Picture of card
This is the output
As you can see the accuracy is not 100%, which is not what I am concerned about I figure I can fix that with some simple per-processing. However if you notice it mixes the two text blocks at the bottom, which splits up the address, and possibly other information on other cards.
How can I possibly use Leptonica(or something else maybe OpenCV) to group the text somehow? Possibly send regions of text on the image individually to tesseract to scan?
I've been stuck on this problem for a while any possible solutions are welcome!

I would recommend using an algorithm called "Run Length Smoothing Algorithm" (RLSA). This algorithm is used in a lot of document image processing systems, though not every system expose it as part of its API.
The original paper was published in 1982 and requires payment. However, the same algorithm is cited by many other papers on document image processing, where you can easily find implementation details and improvements.
One such paper is this: http://www.sciencedirect.com/science/article/pii/S0262885609002005
The basic idea is to scan the document image row by row, recording the width of the gaps between letters.
Then, nearby text characters can be combined by filtering on the width of the gaps, and setting small gaps to the same color as the text. The result will be large connected components that represent:
Words,
By closing the gaps between characters,
Text lines,
By closing the gaps between words, and
Paragraphs
By scanning column by column and then closing the vertical gaps between text lines.
If you do not have access to any document image analysis libraries that expose this functionality, you can mimic the effect by:
Using morphological operations (morphological closing), and then
Perform connected-component labeling on the result.
Most image processing libraries, such as OpenCV, provides such functionality. It might be less efficient to take this approach because you will have to re-run the algorithm using different text gap sizes to achieve the different levels of clustering, unless the user provides your application with the text gap sizes.

I think you've hit on a fundamental problem with OCR - printed designs of this type use white space as meaningful delimiters, but the OCR software doesn't/can't understand that.
This is just a wild stab in the dark, but here's what I would try:
Starting in the upper left, build a box perhaps 1-5% of the size of the whole image. Send that to OCR, and see if you get what looks meaningful back. If not, expand until you get something.
Once you have something, start expanding the block in reasonable units until you stop getting new data back. You can, hopefully, decide this point is "meaningful white space", and now you can consider this processed text as "one block" and thus complete. Now start with whatever the next unprocessed part of the image is, and thus work your way through until you've got the whole image complete.
By working with a set of interlinking expanding boxes, the hope is you'll only get meaningful blocks of data grouped together. Working with your example, once you isolate the logo and process it (and the resulting gibberish), the next box will start with, say, the "N" in Noah. Then you expand out to the right until you've gotten the whole name.
With this done you go again and, hopefully, you'll get a bounding box that includes the "A" in Associate, and get that whole line.
A pixel at a time this would take too long with all those runs to the OCR, I'm sure, but there will surely be a trade-off in "size of chunks to expand per interval" and "amount of processing required".
I don't see why this approach wouldn't work for relatively normal print designs, like a regular style business card.

You can try HOCRText which return all the scanned words along with frame of each word in that image as an xml.
char *boxtext = _tesseract->GetHOCRText(0);
You can parse that xml to get each word and its frame.
Else if you need you can mention the frame in image which should tesseract scan.
_tesseract->SetRectangle(100, 100, 200, 200);
Set this frame before you call recognise. So tesseract will scan only that frame and return text at that frame.

There is a sample iOS application on Github that does this which might be helpful for you:
https://github.com/danauclair/CardScan
How does he read the business card? He writes the following, (or you can read it in the file: https://github.com/danauclair/CardScan/blob/master/Classes/CardParser.m )
// A class used to parse a bunch of OCR text into the fields of an ABRecordRef which can be added to the
// iPhone address book contacts. This class copies and opens a small SQLite databse with a table of ~5500
// common American first names which it uses to help decipher which text on the business card is the name.
//
// The class tokenizes the text by splitting it up by newlines and also by a simple " . " regex pattern.
// This is because many business cards put multiple "tokens" of information on a single line separated by
// spaces and some kind of character such as |, -, /, or a dot.
//
// Once the OCR text is fully tokenized it tries to identify the name (via SQLite table), job title (uses
// a set of common job title words), email, website, phone, address (all using regex patterns). The company
// or organization name is assumed to be the first token/line of the text unless that is the name.
//
// This is obviously a far from perfect parsing scheme for business card text, but it seems to work decently
// on a number of cards that were tested. I'm sure a lot of improvements can be made here.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart