Business Insight:
We are in education domain and we have a requirement to automate the conversion of labeled images (EPS), into interactive exercises
(using HTML/SVG/JavaScript), used by students.
Technical Insight:
Layered EPS files is what we get from the pubishers. The EPS files should be converted into two PNG files: [1.png] Which has label texts only [2.png] Everything else but label texts.
Then [1.png] should be run through some advanced OCR (?) program that should output the label texts along with their positions (X,Y coords) in the image. Then HTML/JavaScript could be used to overlay the label texts over the [2.png] along with some interactions like Drag'n'drop using JavaScript.
Tried so far:
Manually converted the EPS into PNG and used ImageMagick and Tessaract OCR to get the label text alone.
Question:
How far the above requirements of image processing (EPS->PNG+text labels with coords) could be automated and what are the best tools that could be used? Appreciate the help in advance.
PS: I'm an UI developer and could handle the HTML/JavaScript part, if just the coords are provided for the labels.
Related
In a project we use GIMP to create banners (which are saved in the GIMP native format). All the work is done by humans. But it is often tedious work which consists of replacing one logo with an other logo with the same dimensions or one piece of text with an other piece of text. Sometimes many 100 times.
What is the best way to automatically replace an image (with the same dimensions) or a text in a GIMP file? Does it make more sense to script it within GIMP or is it better to do open heard surgery on the file itself without GIMP? Or is there a command line tool which I can use for this?
I'm working on a project in which I want to recognize text from a credit card sized document.The document contains details like name,phone number ,address etc. I'm capturing the image and pass the image into tesseract engine using
text = pytesseract.image_to_string(Image.open(filename), lang = 'eng'). Sometimes I'm getting decent results for each field but most of the time result is very bad. How do I resolve this issue ? What are the best practices. How the document readers work with OCR. Is it possible to process region based ocr in the document ?
A single approach can't read every text. You have to apply multiple approach for multiple types of pdf.
If the text is not horizontal, you have to rotate the text. If the text is curved, you have to use transformation (e.g. hog transform).
Moreover, to read text using the package, the texts should be clear and horizontal. Otherwise you need to create rules and transform them.
I'm working on a college project that involves OCRing a certain digit-code (with a few other characters as seperators - mainly '.','/' etc..) .
that digit code (printed on products for example) is usually in "digital" fonts (e.g. 7-segment-like font, or a pixelated font etc.).
So I am trying to train Tesseract on several digital fonts I've found online, similar to those used with these code.
The thing is, that Tesseract recognizes the tiff files I provide it as blank pages.
Things I've tried:
1. creating a .box file using JTesseract & qt-box (and adjusting the boxes manually) : in this case, the box & tiff are read by Tesseract and I'm getting the output "1 Page", but no characters are recognized and the tr file in blank.
creating a .box file with Tesseract's makebox - in this case no boxes are created at all.
PS - I manage to train it just fine using more traditional fonts (Arial for example)
Any ideas?
Im attaching an image of such an example font.
Thank you!
I managed to work around most of the issues. Posting it in case it could help anyone else:
I did 2 steps to get Tesseract to identify my text:
Image processing on the training images - I've applied some image processing methods (mainly dilate, erode and some blur) to sort of "connect" the pixels in the text that were segmented or separated from one another. Its VERY IMPORTANT to apply the same steps exactly to the images to be fed to the OCR.
I've noticed that simply saving images as TIFF/PNGs via code doesn't save the DPI setting in the header for some reason (and Tesseract identified the as 0 DPI). I assume there's a code-way to do that but I didn't have time, so I just opened the files in Photoshop and saved them from there.
I'm not entirely sure if it was step 1,2 or both that solved my issue, but most characters were eventually identified.
I'm looking for a way to locate known text within an image.
Specifically, I'm trying to create a tool convert a set of scanned pages into PDFs that support searching and copy+paste. I understand how this is usually done: OCR the page, retaining the position of the text, and then add the text as an invisible layer to the PDF. Acrobat has this functionality built in, and tesseract can output hOCR files (containing the recognized text along with its location), which can be used by hocr2pdf to generate a text layer.
Unfortunately, my source images are rather low quality (at most 150 DPI, with plenty of JPEG artifacts, and non-solid backgrounds behind some of the text), leading to pretty poor OCR results. However, I do have the a copy of the text (sans pictures and layout) that appears on each page.
Matching already known text to it's location on a scanned page seems like it would be much easier to do accurately, but I failed to discover any software with this capability built-in. How can I leverage existing software to do this?
Edit: The text varies in size and font, though passages of it are consistent.
The thought that springs to mind for me would be a cross-correlation. So, I would take the list of words that you know occur on the page and render them one at a time onto a canvas to create a picture of that word. You would need to use a similar font and size as the words in the document - which is what I asked in my comment. Then I would run a normalised cross-correlation of the picture of the word against the scanned image to see where it occurs. I would do all that with ImageMagick which is available for Windows and OSX (use homebrew on OS X) and included in most Linux distros.
So, let's take a screengrab of the second paragraph of your question and look for the word pretty - where you mention pretty poor OCR.
First, you need to render the word pretty onto a white background. The command will be something like this:
convert -background white -fill black -font Times -pointsize 14 label:pretty word.png
Result:
Then perform a normalised cross-correlation using Fred Weinhaus's script from here like this:
normcrosscorr -p word.png scan.png correlation-result.png
Match Coords: (504,30) And Score In Range 0 to 1: (0.999803)
and you can see the coordinates of the match are 504,30.
Result:
Another Idea
Another idea might be to take Google's Tesseract-OCR and replace the standard dictionary with the text file containing the words on the page you are processing...
Greetings Overflowers,
Given that:
I have images of documents with text of mixed languages
I need this text to be highlightable (word by word) by end users
I have this text in plain digital format already
I will help my program to figure out where words are
I do not want my help to be tedious to me
I will also manually fix small inaccuracies after my program
What is the best easy help I can provide for my program to be able to draw rectangles around selected words ? What algorithm would you use for this program ? I tried OCR stuff like OmniPage Pro but they do not provide this functionality.
Regards
I have implemented a word bounding box and highlighted words in my application some years ago. You said "I have this text in plain digital format". One key component is to have coordinates of characters or words in order to map them to proper image areas. Like in searchable PDF, when you select text it is internally mapped to the image layer, and opposite selection on the image selects matching text. But even from PDF those coordinates cannot be exported I believe. If no such coordinate informaiton exists currently in your text, easiest is probably to re-OCR images with a high quality engine that can produce coordinates as part of output. If you were to use WiseTREND OCR Cloud 2.0, then XML output will produce all that detailed metadata. If coordinates informaiton exists, then all major components are there and it is just work around efficient UI design.