Semi-Automatic Text Highlighting in Images? - image-processing

Greetings Overflowers,
Given that:
I have images of documents with text of mixed languages
I need this text to be highlightable (word by word) by end users
I have this text in plain digital format already
I will help my program to figure out where words are
I do not want my help to be tedious to me
I will also manually fix small inaccuracies after my program
What is the best easy help I can provide for my program to be able to draw rectangles around selected words ? What algorithm would you use for this program ? I tried OCR stuff like OmniPage Pro but they do not provide this functionality.
Regards

I have implemented a word bounding box and highlighted words in my application some years ago. You said "I have this text in plain digital format". One key component is to have coordinates of characters or words in order to map them to proper image areas. Like in searchable PDF, when you select text it is internally mapped to the image layer, and opposite selection on the image selects matching text. But even from PDF those coordinates cannot be exported I believe. If no such coordinate informaiton exists currently in your text, easiest is probably to re-OCR images with a high quality engine that can produce coordinates as part of output. If you were to use WiseTREND OCR Cloud 2.0, then XML output will produce all that detailed metadata. If coordinates informaiton exists, then all major components are there and it is just work around efficient UI design.

Related

Best practices for tesseract ocr using python

I'm working on a project in which I want to recognize text from a credit card sized document.The document contains details like name,phone number ,address etc. I'm capturing the image and pass the image into tesseract engine using
text = pytesseract.image_to_string(Image.open(filename), lang = 'eng'). Sometimes I'm getting decent results for each field but most of the time result is very bad. How do I resolve this issue ? What are the best practices. How the document readers work with OCR. Is it possible to process region based ocr in the document ?
A single approach can't read every text. You have to apply multiple approach for multiple types of pdf.
If the text is not horizontal, you have to rotate the text. If the text is curved, you have to use transformation (e.g. hog transform).
Moreover, to read text using the package, the texts should be clear and horizontal. Otherwise you need to create rules and transform them.

Finding known text in an image (guided OCR)

I'm looking for a way to locate known text within an image.
Specifically, I'm trying to create a tool convert a set of scanned pages into PDFs that support searching and copy+paste. I understand how this is usually done: OCR the page, retaining the position of the text, and then add the text as an invisible layer to the PDF. Acrobat has this functionality built in, and tesseract can output hOCR files (containing the recognized text along with its location), which can be used by hocr2pdf to generate a text layer.
Unfortunately, my source images are rather low quality (at most 150 DPI, with plenty of JPEG artifacts, and non-solid backgrounds behind some of the text), leading to pretty poor OCR results. However, I do have the a copy of the text (sans pictures and layout) that appears on each page.
Matching already known text to it's location on a scanned page seems like it would be much easier to do accurately, but I failed to discover any software with this capability built-in. How can I leverage existing software to do this?
Edit: The text varies in size and font, though passages of it are consistent.
The thought that springs to mind for me would be a cross-correlation. So, I would take the list of words that you know occur on the page and render them one at a time onto a canvas to create a picture of that word. You would need to use a similar font and size as the words in the document - which is what I asked in my comment. Then I would run a normalised cross-correlation of the picture of the word against the scanned image to see where it occurs. I would do all that with ImageMagick which is available for Windows and OSX (use homebrew on OS X) and included in most Linux distros.
So, let's take a screengrab of the second paragraph of your question and look for the word pretty - where you mention pretty poor OCR.
First, you need to render the word pretty onto a white background. The command will be something like this:
convert -background white -fill black -font Times -pointsize 14 label:pretty word.png
Result:
Then perform a normalised cross-correlation using Fred Weinhaus's script from here like this:
normcrosscorr -p word.png scan.png correlation-result.png
Match Coords: (504,30) And Score In Range 0 to 1: (0.999803)
and you can see the coordinates of the match are 504,30.
Result:
Another Idea
Another idea might be to take Google's Tesseract-OCR and replace the standard dictionary with the text file containing the words on the page you are processing...

Computer Vision - Use Image Matching or OCR to recognize page of a text only book?

I want to be able to recognize what page of a text only (no images) book I'm on... what is the best approach:
I was initially thinking some sort of image matching but the pages of an all text book look so similar not sure how well this would work?
Second thought was to use OCR??
Any ideas or suggestions... thanks!
I think image matching is really useless in your case...
If you want to detect on which page you are and that the book has numbered pages you can use an OCR like Tesseract.
1) Locate the page number (top left hand corner, right, bottom..)
2) Extract it (extract the imaget to proceed to decoding on it)
( 2bis) Preprocess the imaget... )
3) Decode it (use Tesseract or another OCR)
If you don't want to use an OCR you can look at Hu Moments, if the numbers are standard printed numbers it can be quite good at recognising them.

Convertng labled images (EPS) to interactive web page using ImageMagick, OCR, JavaScript

Business Insight:
We are in education domain and we have a requirement to automate the conversion of labeled images (EPS), into interactive exercises
(using HTML/SVG/JavaScript), used by students.
Technical Insight:
Layered EPS files is what we get from the pubishers. The EPS files should be converted into two PNG files: [1.png] Which has label texts only [2.png] Everything else but label texts.
Then [1.png] should be run through some advanced OCR (?) program that should output the label texts along with their positions (X,Y coords) in the image. Then HTML/JavaScript could be used to overlay the label texts over the [2.png] along with some interactions like Drag'n'drop using JavaScript.
Tried so far:
Manually converted the EPS into PNG and used ImageMagick and Tessaract OCR to get the label text alone.
Question:
How far the above requirements of image processing (EPS->PNG+text labels with coords) could be automated and what are the best tools that could be used? Appreciate the help in advance.
PS: I'm an UI developer and could handle the HTML/JavaScript part, if just the coords are provided for the labels.

Generate font from an image of text

Is it possible to generate a specific
set of font from the below given image
?
My idea is to generate a specific font
for the below given image of text ,by
manually selecting portion of the
image and mapping it to a set of
letter's.Generate the font for this
and then use this font to make it
readable for an OCR.Is generation of
font possible using any open-source
implementation ? Also please suggest
any good OCR's.
Abbyy FineReader 10 gets better than expected results but predictably gets confused when the characters touch.
Your problem is that the line spacing is too small. The descenders of each line overlap the character bounding boxes of the characters in the line directly below. This makes character segmentation almost impossible because the characters are touching and overlapping. The number of combinations of overlapping characters is virtually impossible to train for. The 'g' and 'y' characters are the worst offenders.
A double line spaced version of this would probably OCR reasonably well.
A custom solution that segmented and separated the each line along with a good dictionary would definitely improve the results. There would still be some errors to correct manually though. The custom routine would have to deal with the ascenders and descenders and try and segment the image into lines which can then be fed to a decent OCR engine. One way would be to analyse every character blob on the page and allocate it to a line. Leptonica (www.leptonica.com - C Imaging Library) would probably make this job a little easier.
I would not try this without increasing the resolution to 200 or 300 dpi first.
With this custom solution, training a font becomes an option if the OCR engine does a poor job initially.
Abbyy (www.abbyy.com) or Google Tesseract OCR 3.00 would be a good place to start.
No guarantees as to whether all of this will work though. This is quite a difficult page to OCR and you need to work out whether it is better to have it typed up manually overseas. It depends on the number of pages to need to process.

Resources