Best practices for tesseract ocr using python - machine-learning

I'm working on a project in which I want to recognize text from a credit card sized document.The document contains details like name,phone number ,address etc. I'm capturing the image and pass the image into tesseract engine using
text = pytesseract.image_to_string(Image.open(filename), lang = 'eng'). Sometimes I'm getting decent results for each field but most of the time result is very bad. How do I resolve this issue ? What are the best practices. How the document readers work with OCR. Is it possible to process region based ocr in the document ?

A single approach can't read every text. You have to apply multiple approach for multiple types of pdf.
If the text is not horizontal, you have to rotate the text. If the text is curved, you have to use transformation (e.g. hog transform).
Moreover, to read text using the package, the texts should be clear and horizontal. Otherwise you need to create rules and transform them.

Related

How can I improve Tesseract results quality?

I'm trying to read the NIRPP number (social security number) from a French vital card using Tesseract's OCR (I'm using TesseractOCRiOS 4.0.0). So here is what I'm doing :
First, I request a picture of the whole card :
Then, using a custom cropper, I ask the user to zoom specifically on the card number:
And then I catch this image (1291x202px) and using Tesseract I try to read the number:
let tesseract = G8Tesseract(language: "eng")
tesseract?.image = pickedImage
tesseract?.recognize()
print("\(tesseract?.recognizedText ?? "")")
But I'm getting pretty bad results... only like 30% of the time Tesseract is able to find the right number, and among these sometimes I need to trim some characters (like alpha characters, dots, dashes...).
So is there a solution for me to improve these results?
Thanks for your help.
To improve your results :
Zoom your image to appropriate level. Right amount of zoom will improve your accuracy by a lot.
Configure tesseract so that only digits are whitelisted . I am
assuming here what you are trying to extract contains only digits.If
you whitelist only digits then it will improve your chances of
recognizing 0 as 0 and not O character.
If your extracted text matches a regex, you should configure
tesseract to use that regex as well.
Pre process your image to remove any background colors and apply
Morphology effects like erode to increase the space between your
characters/digits. If they are too close , tesseract will have
hard time recognizing them correctly. Most of the image processing
library comes prebuilt with those effects.
Use tiff as image format.
Once you have the right preprocessing pipeline and configuration for tesseract , you will usually get a very good and consistent result.
There are couple of things you need to do it....
1.you need to apply black and white or gray scale on image.
you will use default functionality like Graphics framework or third party libray like openCV or GPUImage for applying black&white or grayscale.
2.and then apply text detection using Vision framework.
From vision text detection you can crop texts according to vision text detected coordinates.
3.pass this cropped images(text detected) to TesseractOCRiOS...
I hope it will work for your use-case.
Thanks
I have a similar issue. I discovered that Tesseract recognizes a text only if the given image contain a region of interest.
I solved the problem using Apple' Vision framework. It has VNDetectTextRectanglesRequest that returns CGRect of detected text according to the image. Then you can crop the image to region where text is present and send them to Tesseract for detection.
Ray Smith says:
Since HP had independently-developed page layout analysis technology that was used in products, (and therefore not released for open-source) Tesseract never needed its own page layout analysis. Tesseract therefore assumes that its input is a binary image with optional polygonal text regions defined.

pytesseract - Read text from images with more accuracy

I am working on pytesseract. I want to read data from Driving License kind of thing. Presently i am converting .jpg image to binary(gray scale) format using opencv but i am not accurate result. How do you solve this? Is there any standard size of image?
Localize your detection by setting the rectangles where Tesseract has to look. You can then restrict according to rectangle which type of data is present at that place example: numerical,alphabets etc.You can also make a dictionary file for tesseract to improve accuracy(This can be used for detecting card holder name by listing common names in a file). If there is disturbance in the background then design a filter to remove it. Good Luck!

Semi-Automatic Text Highlighting in Images?

Greetings Overflowers,
Given that:
I have images of documents with text of mixed languages
I need this text to be highlightable (word by word) by end users
I have this text in plain digital format already
I will help my program to figure out where words are
I do not want my help to be tedious to me
I will also manually fix small inaccuracies after my program
What is the best easy help I can provide for my program to be able to draw rectangles around selected words ? What algorithm would you use for this program ? I tried OCR stuff like OmniPage Pro but they do not provide this functionality.
Regards
I have implemented a word bounding box and highlighted words in my application some years ago. You said "I have this text in plain digital format". One key component is to have coordinates of characters or words in order to map them to proper image areas. Like in searchable PDF, when you select text it is internally mapped to the image layer, and opposite selection on the image selects matching text. But even from PDF those coordinates cannot be exported I believe. If no such coordinate informaiton exists currently in your text, easiest is probably to re-OCR images with a high quality engine that can produce coordinates as part of output. If you were to use WiseTREND OCR Cloud 2.0, then XML output will produce all that detailed metadata. If coordinates informaiton exists, then all major components are there and it is just work around efficient UI design.

Convertng labled images (EPS) to interactive web page using ImageMagick, OCR, JavaScript

Business Insight:
We are in education domain and we have a requirement to automate the conversion of labeled images (EPS), into interactive exercises
(using HTML/SVG/JavaScript), used by students.
Technical Insight:
Layered EPS files is what we get from the pubishers. The EPS files should be converted into two PNG files: [1.png] Which has label texts only [2.png] Everything else but label texts.
Then [1.png] should be run through some advanced OCR (?) program that should output the label texts along with their positions (X,Y coords) in the image. Then HTML/JavaScript could be used to overlay the label texts over the [2.png] along with some interactions like Drag'n'drop using JavaScript.
Tried so far:
Manually converted the EPS into PNG and used ImageMagick and Tessaract OCR to get the label text alone.
Question:
How far the above requirements of image processing (EPS->PNG+text labels with coords) could be automated and what are the best tools that could be used? Appreciate the help in advance.
PS: I'm an UI developer and could handle the HTML/JavaScript part, if just the coords are provided for the labels.

Generate font from an image of text

Is it possible to generate a specific
set of font from the below given image
?
My idea is to generate a specific font
for the below given image of text ,by
manually selecting portion of the
image and mapping it to a set of
letter's.Generate the font for this
and then use this font to make it
readable for an OCR.Is generation of
font possible using any open-source
implementation ? Also please suggest
any good OCR's.
Abbyy FineReader 10 gets better than expected results but predictably gets confused when the characters touch.
Your problem is that the line spacing is too small. The descenders of each line overlap the character bounding boxes of the characters in the line directly below. This makes character segmentation almost impossible because the characters are touching and overlapping. The number of combinations of overlapping characters is virtually impossible to train for. The 'g' and 'y' characters are the worst offenders.
A double line spaced version of this would probably OCR reasonably well.
A custom solution that segmented and separated the each line along with a good dictionary would definitely improve the results. There would still be some errors to correct manually though. The custom routine would have to deal with the ascenders and descenders and try and segment the image into lines which can then be fed to a decent OCR engine. One way would be to analyse every character blob on the page and allocate it to a line. Leptonica (www.leptonica.com - C Imaging Library) would probably make this job a little easier.
I would not try this without increasing the resolution to 200 or 300 dpi first.
With this custom solution, training a font becomes an option if the OCR engine does a poor job initially.
Abbyy (www.abbyy.com) or Google Tesseract OCR 3.00 would be a good place to start.
No guarantees as to whether all of this will work though. This is quite a difficult page to OCR and you need to work out whether it is better to have it typed up manually overseas. It depends on the number of pages to need to process.

Resources