I have to OCR table from PDF document. I wrote simple Python+opencv script to get individual cells. After that new problem arose. Text is antialiased and not good-quality.
Recognition rate of tesseract is very low. I've tried to preprocess images with adaptive thresholding but results weren't much better.
I've tried trial version of ABBYY FineReader and indeed it gives fine output, but I don't want to use non-free software.
I wonder if some preprocessing would solve issue or is it nessesary to write and learn other OCR system.
If you look closely at your antialiased text samples, you'll notice that the edges contain a lot of red and blue:
This suggests that the antialiasing is taking place inside your computer, which has used subpixel rendering to optimise the results for your LCD monitor.
If so, it should be quite easy to extract the text at a higher resolution. For example, you can use ImageMagick to extract images from PDF files at 300 dpi by using a command line like the following:
convert -density 300 source.pdf output.png
You could even try loading the PDF in your favourite viewer and copying the text directly to the clipboard.
Addendum:
I tried converting your sample text back into its original pixels and applying the scaling technique mentioned in the comments. Here are the results:
Original image:
After scaling 300% and applying simple threshold:
After smart scaling and thresholding:
As you can see, some of the letters are still a bit malformed, but I think there's a better chance of reading this with Tesseract.
I'm developing an app which can recognize license plates (ANPR). The first step is to extract the licenses plates from the image. I am using OpenCV to detect the plates based on width/height ratio and this works pretty well:
But as you can see, the OCR results are pretty bad.
I am using tesseract in my Objective C (iOS) environment. These are my init variables when starting the engine:
// init the tesseract engine.
tesseract = new tesseract::TessBaseAPI();
int initRet=tesseract->Init([dataPath cStringUsingEncoding:NSUTF8StringEncoding], [language UTF8String]);
tesseract->SetVariable("tessedit_char_whitelist", "BCDFGHJKLMNPQRSTVWXYZ0123456789-");
tesseract->SetVariable("language_model_penalty_non_freq_dict_word", "1");
tesseract->SetVariable("language_model_penalty_non_dict_word ", "1");
tesseract->SetVariable("load_system_dawg", "0");
How can I improve the results? Do I need to let OpenCV do more image manipulation? Or is there something I can improve with tesseract?
Two things will fix this completely:
Remove everything which is not text from the image. You need to use some CV to find the plate area (for example by color, etc) and then mask out all of the background. You want the input to tesseract to be black and white, where text is black and everything else is white
Remove skew (as mentioned by FrankPI above). tesseract is actually supposed to work okay with skew (see "Tesseract OCR Engine" overview by R. Smith) but on the other hand it doesn't always work, especially if you have a single line as opposed to a few paragraphs. So removing skew manually first is always good, if you can do it reliably. You will probably know the exact shape of the bounding trapezoid of the plate from step 1, so this should not be too hard. In the process of removing skew, you can also remove perspective: all license plates (usually) have the same font, and if you scale them to the same (perspective-free) shape the letter shapes would be exactly the same, that would help text recognition.
Some further pointers...
Don't try to code this at first: take a really easy to OCR (ie: from directly in front, no perspective) picture of a plate, edit it in photoshop (or gimp) and run it through tesseract on the commandline. Keep editing in different ways until this works. For example: select by color (or flood select the letter shapes), fill with black, invert selection, fill with white, perspective transform so corners of plate are a rectangle, etc. Take a bunch of pictures, some harder (maybe from odd angles, etc). Do this with all of them. Once this works completely, think about how to make a CV algorithm that does the same thing you did in photoshop :)
P.S. Also, it is better to start with higher resolution image if possible. It looks like the text in your example is around 14 pixels tall. tesseract works pretty well with 12 point text at 300 dpi, this is about 50 pixels tall, and it works much better at 600 dpi. Try to make your letter size be at least 50 preferably 100 pixels.
P.P.S. Are you doing anything to train tesseract? I think you have to do that, the font here is different enough to be a problem. You probably also need something to recognize (and not penalize) dashes which will be very common in your texts, looks like in the second example "T-" is recognized as H.
I don't know tesseract too much, but I have some information about OCR. Here we go.
In an OCR task you need to be sure that, your train data has the same font that you are trying to recognize. Or if you are trying to recognize multiple fonts, be sure that you have those fonts in your train data to get best performance.
As far as I know, tesseract applies OCR in few different ways: One, you give an image which has multiple letters in it and let tesseract do the segmentation. And other, you give segmented letters to tesseract and only expect it to recognize the letter. Maybe you can try to change the one which you are using.
If you are training recognizer by yourself be sure that you have enough and equally amount of each letter in your train data.
Hope this helps.
I've been working on an iOS app, if you need to improve the results you should train tesseract OCR, this improved 90% for me. Before tranning, OCR results were pretty bad.
So, I used this gist in the past to train tesseract ORC with a licence plate font.
If you are interested, I open-sourced this project some weeks ago on github
Here is my real world example with trying out OCR from my old power meter. I would like to use your OpenCV code so that OpenCV does automatic cropping of image, and I'll do image cleaning scripts.
First image is original image (croped power meter numbers)
Second image is slightly cleaned up image in GIMP, around 50% OCR accuracy in tesseract
Third image is completely cleaned image - 100% OCR recognized without any training!
Now License Plate can be easily recognized by mlmodel. I have created the core model you can find it here . You just need to split characters in 28*28 resolution through vision framework and send this image to VNImageRequestHandler like given below-
let handler = VNImageRequestHandler(cgImage: imageUI.cgImage!, options: [:])
you will get desired results by using my core mlmodel. Use this link for better clarification but use my model for better results in license plate recognition. I have also created the mlmodel for License Plate Recognition.
I was watching this talk from pycon http://youtu.be/B1d9dpqBDVA?t=15m34s around the 15:33 mark the speaker talks about extracting lines from an image (receipt) and then feeding that to the OCR engine so that text can be extracted in a better way.
I have a similar need where I'm passing images to the OCR engine. However, I don't quite understand what he means by extracting lines from an image. What are some open source tools that I can use to extract lines from an image?
Take a look at the technique used to detect the skew angle of a text.
Groups are lines are used to isolate text on an image (this is the interesting part).
From this result you can easily detect the upper/lower limits of each line of text. The text itself will be located inside them. I've faced a similar problem before, the code might be useful to you:
All you need to do from here is crop each pair of lines and feed that as an image to Tesseract.
i can tell u a simple technique to feed the images to OCR.. just perform some operations to get the ROI (Region of Interest) of ur image, and localize the area where the image after binarizing it.. then you may find contours, and by keeping the threasholding value, and setting the required contour area, you can feed the resulting image to OCR :) ..
(sorry for bad way of explaination)
Direct answer: you extract lines from an image with Hough Transform.
You can find an analytical guide here.
Text lines can be detected as well. Karlphillip's answer is based on Hough Transform too.
Tesseract works for images that contains only and only text. But what if there is text and image and we want to get only text to be recognized.
I am using Tesseract for OCR recognition of text from image. Tesseract is giving exact text from the images that are having only text in them. However when I checked the image that contains car and its car number, Tesseract gave different garbled text for the car number. I applied gray scale optimization, threshold and other effects to get the exact text output and to increase the accuracy of the output but it still giving different text mixed with different encoding. For the same, I am looking for other ways to extract such text.
Can anyone know that how to get text from such images using Tesseract OCR or any alternative so that only text part remains in image so that Tesseract can give the exact text in output.
To crop the image is one alternative to get the only text but how to do that using ImageMagick or any other option.
Thanks.
If you know exactly where on the image the text is, you can send along with the image the coordinates of those regions to Tesseract for recognition. Take a look at Tesseract API method TesseractRect or SetRectangle.
I get data from video so there is not way for me to rescan the image, but I can scale them if necessary.
I do have only a limited number of characters, 1234567890:, but I have no control over the dpi of the original image or the font.
I tried to train tesseract but without any visible effect, the test project is located at https://github.com/ssbarnea/tesseract-sample but the current results are really bad.
Example of original image being captured:
Example of postprocessed image for OCR:
How can I improve the OCR process in this case?
You can try to add some extra space at the edges of the image, sometimes it helps for tesseract. However, opensource OCR engines are very sensitive to the source image DPI.