Finding known text in an image (guided OCR) - image-processing

I'm looking for a way to locate known text within an image.
Specifically, I'm trying to create a tool convert a set of scanned pages into PDFs that support searching and copy+paste. I understand how this is usually done: OCR the page, retaining the position of the text, and then add the text as an invisible layer to the PDF. Acrobat has this functionality built in, and tesseract can output hOCR files (containing the recognized text along with its location), which can be used by hocr2pdf to generate a text layer.
Unfortunately, my source images are rather low quality (at most 150 DPI, with plenty of JPEG artifacts, and non-solid backgrounds behind some of the text), leading to pretty poor OCR results. However, I do have the a copy of the text (sans pictures and layout) that appears on each page.
Matching already known text to it's location on a scanned page seems like it would be much easier to do accurately, but I failed to discover any software with this capability built-in. How can I leverage existing software to do this?
Edit: The text varies in size and font, though passages of it are consistent.

The thought that springs to mind for me would be a cross-correlation. So, I would take the list of words that you know occur on the page and render them one at a time onto a canvas to create a picture of that word. You would need to use a similar font and size as the words in the document - which is what I asked in my comment. Then I would run a normalised cross-correlation of the picture of the word against the scanned image to see where it occurs. I would do all that with ImageMagick which is available for Windows and OSX (use homebrew on OS X) and included in most Linux distros.
So, let's take a screengrab of the second paragraph of your question and look for the word pretty - where you mention pretty poor OCR.
First, you need to render the word pretty onto a white background. The command will be something like this:
convert -background white -fill black -font Times -pointsize 14 label:pretty word.png
Result:
Then perform a normalised cross-correlation using Fred Weinhaus's script from here like this:
normcrosscorr -p word.png scan.png correlation-result.png
Match Coords: (504,30) And Score In Range 0 to 1: (0.999803)
and you can see the coordinates of the match are 504,30.
Result:
Another Idea
Another idea might be to take Google's Tesseract-OCR and replace the standard dictionary with the text file containing the words on the page you are processing...

Related

Training Tesseract on specific fonts results in empty tr files

I'm working on a college project that involves OCRing a certain digit-code (with a few other characters as seperators - mainly '.','/' etc..) .
that digit code (printed on products for example) is usually in "digital" fonts (e.g. 7-segment-like font, or a pixelated font etc.).
So I am trying to train Tesseract on several digital fonts I've found online, similar to those used with these code.
The thing is, that Tesseract recognizes the tiff files I provide it as blank pages.
Things I've tried:
1. creating a .box file using JTesseract & qt-box (and adjusting the boxes manually) : in this case, the box & tiff are read by Tesseract and I'm getting the output "1 Page", but no characters are recognized and the tr file in blank.
creating a .box file with Tesseract's makebox - in this case no boxes are created at all.
PS - I manage to train it just fine using more traditional fonts (Arial for example)
Any ideas?
Im attaching an image of such an example font.
Thank you!
I managed to work around most of the issues. Posting it in case it could help anyone else:
I did 2 steps to get Tesseract to identify my text:
Image processing on the training images - I've applied some image processing methods (mainly dilate, erode and some blur) to sort of "connect" the pixels in the text that were segmented or separated from one another. Its VERY IMPORTANT to apply the same steps exactly to the images to be fed to the OCR.
I've noticed that simply saving images as TIFF/PNGs via code doesn't save the DPI setting in the header for some reason (and Tesseract identified the as 0 DPI). I assume there's a code-way to do that but I didn't have time, so I just opened the files in Photoshop and saved them from there.
I'm not entirely sure if it was step 1,2 or both that solved my issue, but most characters were eventually identified.

MySQL WorkBench EER diagram dimensions are terrible

I am using MYSQL workbench to generate an EER diagram, and to the best of my knowledge, one can not control the dimensions of the canvas, only the size in number of pages. This has the result that you get a huge amount of white space around your diagram, making it nearly unusable. Why anyone would design it this way is beyond me. There are a lot of questions which ask how to crop a pdf, but they are either more complicated (ie. crop to a certain dimension, or crop and output to different format and ratio) or they do not preserve the image quality, or they just plain do not work. My question therefore is this:
How does one create or convert an EER diagram using MySQL Workbench such that there are no white borders AND the image quality is preserved?
Note I asked the question here as it pertains to databasing, but apologies if it is in the wrong place.
Looks like what you are after is a way to limit the output of an image export to a relatively small area, so that it fits nicely in another document. Several options are possible:
1) Export as png and simply cut off the unwanted parts. Depending on the further usage this might be good enough.
2) Export as SVG and use any of the SVG editors to limit the image size to the wanted area only. Then convert it to the format you need in your target document.
3) Set a paper size in the model that encompasses the content as close as possible. E.g. the statement paper type is quite small. Then rearrange your objects. Resize them if you need larger ones. By setting a larger font (via Preferences) you should be able to make the entire appearence larger. Then export as PDF.

ImageMagick - Transparent background - Act like Photoshop's "Magic wand"

I'm trying to convert hundreds of images that
Have an unknown subject centered in the image
Have a white background
I've used ImageMagick's convert utility in the following way
convert ORIGINAL.jpg -fuzz 2% -matte -transparent "#FFFFFF" TRANSPARENT.png
The problem is, some of my subjects are within the "white" scale, so, just like the weatherman wearing a green tie, some of my subjects seem to be disitegrating.
Is there any way to solve this via ImageMagick? Are there any alternative solutions? Scripting GIMP?
As you said, GIMP has a magic wand tool that can be used to select continuous areas of the same color, and so it can avoid the "green tie syndrome". The problem is that it may introduce a problem if there is something like a human hair crossing the image (that will seperate some of the white areas). Another common problem, especially with pictures of people, is when they put their hand next to the body and between the hand and the body there is a small hole.
Basically, it is not too hard to create a GIMP script that opens in batch many images, uses the magic wand to select the pixel at some corner (or if desired, in several known fixed places, not just one) and then removes the selection.
If it's hard to find a white area at a fixed spot, it is possible to do a search inside - meaning that the script searches for a white pixel on the borders, and it goes inside gradually in a spiral untill it finds some white pixel. But this is very very unefficient in the basic scripting engine, so I hope you don't need this.
If any of the suggested options above is OK, tell me and I'll create a gimp script for it. It will be even better if you can post some samples images, but I'll try to help even without these.

Semi-Automatic Text Highlighting in Images?

Greetings Overflowers,
Given that:
I have images of documents with text of mixed languages
I need this text to be highlightable (word by word) by end users
I have this text in plain digital format already
I will help my program to figure out where words are
I do not want my help to be tedious to me
I will also manually fix small inaccuracies after my program
What is the best easy help I can provide for my program to be able to draw rectangles around selected words ? What algorithm would you use for this program ? I tried OCR stuff like OmniPage Pro but they do not provide this functionality.
Regards
I have implemented a word bounding box and highlighted words in my application some years ago. You said "I have this text in plain digital format". One key component is to have coordinates of characters or words in order to map them to proper image areas. Like in searchable PDF, when you select text it is internally mapped to the image layer, and opposite selection on the image selects matching text. But even from PDF those coordinates cannot be exported I believe. If no such coordinate informaiton exists currently in your text, easiest is probably to re-OCR images with a high quality engine that can produce coordinates as part of output. If you were to use WiseTREND OCR Cloud 2.0, then XML output will produce all that detailed metadata. If coordinates informaiton exists, then all major components are there and it is just work around efficient UI design.

Convertng labled images (EPS) to interactive web page using ImageMagick, OCR, JavaScript

Business Insight:
We are in education domain and we have a requirement to automate the conversion of labeled images (EPS), into interactive exercises
(using HTML/SVG/JavaScript), used by students.
Technical Insight:
Layered EPS files is what we get from the pubishers. The EPS files should be converted into two PNG files: [1.png] Which has label texts only [2.png] Everything else but label texts.
Then [1.png] should be run through some advanced OCR (?) program that should output the label texts along with their positions (X,Y coords) in the image. Then HTML/JavaScript could be used to overlay the label texts over the [2.png] along with some interactions like Drag'n'drop using JavaScript.
Tried so far:
Manually converted the EPS into PNG and used ImageMagick and Tessaract OCR to get the label text alone.
Question:
How far the above requirements of image processing (EPS->PNG+text labels with coords) could be automated and what are the best tools that could be used? Appreciate the help in advance.
PS: I'm an UI developer and could handle the HTML/JavaScript part, if just the coords are provided for the labels.

Resources