Improve OCR accuracy from scanned documents - image-processing

I'm scanning a lot of A3 documents using a standard Brother A3 Multifunction and then use FineReader Pro for OCR'ing the images.
However, I'm getting a lot of errors in the characters recognized, and lots of non-alphanumeric strange characters.
Can someone give me any tips for programmatically improving the OCR accuracy, either pre-processing on the scanned images, or post-processing on the recognized text?
Edit: Find a sample pdf. It includes some sample images from which I get the poorest results.

Do you have a sample image you can post somewhere then we can quickly tell you what is causing most of your problems. FineReader is one of the better OCR engines out there so there are definitely reasons why you are getting poor results.
It could be related to poor contrast and threshold settings, image skewing, dirty rollers in the scanner, complex and coloured backgrounds, dithered backgrounds, font sizes too small, scanning dpi being too low etc...
After seeing the attached image there are a few small issues.
There are lots of dirty specks on the background page. FineReader seems to do a reasonable job with this on your images.
There is some slight skew but that is not causing and problems.
FineReader is getting confused with BOLD tall Arial type font used for column headers.
4 A big problem seems to be the bottom region of the pages where the contrast is poor and the image is fuzzy. This seems to be a problem with the scanner but could be due to printing problems.
The printing is quite poor and I am guessing it is a scan from a newspaper. Most of your errors are due to scanning issues so it would be hard to programmatically improve the results.
Firstly, I would try scanning the image in grayscale using a slightly higher resolution and see if that helps. FineReader works well with grayscale images. If you have to have a B/W image then see if the scanner driver includes a setting for dynamic thresholding and turn it on.
Your images would not be an easy task for any OCR engine. You will get better results if you can improve the scanning. Page 3 has a lot of noise in the bottom right corner.
What version of FineReasder are you using ? FR10 would probably give better results than previous versions.

Related

image similarity algorithm for charts

for automating the tests of a legacy Windows application I need to compare screenshots of charts.
Pixel comparison works fine as long as the Windows session opens with same resolution, DPI, color depth, font size, font family, etc., otherwise the screenshot taken during the test may differ slightly from that recorded during the development of the test.
Therefore, I am looking for a method that allows slight variations and produces a score rather than a boolean.
Started with scaling the retrieved screenshot to match the size of recorded one. Of course, pixel comparison fails.
Then I tried to use SSIM to get a similarity score (used https://github.com/rhys-e/structural-similarity). Definitely it does not work for my case -- see below a simplified experiment.
Any suggestions?
Thanks in advance,
Adrian.
SSIM experiments
This is the reference picture:
This one contains a black line slightly above than the reference --> getting 0.9447093986742424
This one, completely different --> getting 0.9516260505445076

Perceptual Image Comparison

I'm trying to do image comparison to detect changes in a video processing application. These are two images that look identical to me, but are different according to both
http://pdiff.sourceforge.net/
and http://www.itec.uni-klu.ac.at/lire/nightly/api/net/semanticmetadata/lire/imageanalysis/LireFeature.html
Can anyone explain the difference? Eventually I need to find a library that can detect differences that doesn't have any false positives.
The two images are different.
I used GIMP (open source) to stack the two images one on top of the other and do a difference for the top layer. It showed a very faint black image, i.e. very little difference. I then used Curve to raise the tones and it revealed that what seem to be JPEG artifacts, even though the files given are PNG. I recommend GIMP and sometimes I use it instead of Photoshop.
Using GIMP to do a blink comparison between layers at 400% view, I would guess that the first image is closer to the original. The second may be saved copy of the first or from the original but saved at a lower quality setting.
It seems that the metadata has been stripped off both images (haven't done a definitive look), so no clues there.
There was a program called Unique Filer that I used for years. It is tunable and rather good. But any comparator is likely to generate a number of false positives if you tune it well enough to make sure it doesn't miss duplicates. If you only want to catch images that are very similar like this pair, then you can tune it very tightly. It is old and may not work on Windows 7 or later.
I would like to find good image checkers / comparators too. I've considered writing my own program.

How can I tell Tesseract that my font has a particular size?

I have a collection of type-written image captions which look like this:
I know that the typewriter is consistent and monospace, with characters measuring 14x22px (as measured from the top of a capital letter to the bottom of a descender).
Tesseract is producing output like this:
The results are mostly good when Tesseract has detected the correct bounding boxes for the letters. But there are many strings of letters which are clumped together (e.g. "Ea", "tree", "fr" and "om" on the first line). These are always transcribed incorrectly and account for the majority of errors.
This is frustrating because I know a priori that all the characters are of a particular size. Is it possible pass this knowledge on to the tesseract command line tool?
My command to generate the box file is:
tesseract foo.jpg foo batch.nochop makebox
If possible, I'd prefer to avoid training Tesseract on the font—I don't have any manually transcribed samples, so building a corpus of training data would require some effort.
I'm not sure that Tesseract throws connected characters completely off as Noremac said.
Actually I think that it includes a chopping of joined characters whenever the result of a word detection is unsatisfactory, as explained in the paragraph 4.1 of An Overview of the Tesseract OCR Engine
And I also think that once it finds a fixed pitch text, it should automatically chop the text, even if the characters are connected (look at figure 2 of the same paper).
I know that it's a little bit late to add this answer, but maybe it will help some future visitors!
The issue isn't the font size as much as it is with the letters connecting. If you zoom in on the above images with a program that will show the actual pixels (rather than blurring them together) you can see that those grouping two characters are actually connected. tessearctOCR is completely based on connected components so if they are connected at all then it throws it completely off. I see a couple of options:
If possible, give it a higher resolution image where there is more separation between the characters
Adjust the preprocessing to do a more strict threshold.
I noticed that the pixel connecting the E and the a on the first occurrence is lighter so adjusting the threshold will remove that connection. However, this could affect more than what you want, such as disjointing characters where you don't expect.
For updating the thresholding consider this: https://groups.google.com/forum/#!topic/tesseract-ocr/JRwIz3xL45U

Generate font from an image of text

Is it possible to generate a specific
set of font from the below given image
?
My idea is to generate a specific font
for the below given image of text ,by
manually selecting portion of the
image and mapping it to a set of
letter's.Generate the font for this
and then use this font to make it
readable for an OCR.Is generation of
font possible using any open-source
implementation ? Also please suggest
any good OCR's.
Abbyy FineReader 10 gets better than expected results but predictably gets confused when the characters touch.
Your problem is that the line spacing is too small. The descenders of each line overlap the character bounding boxes of the characters in the line directly below. This makes character segmentation almost impossible because the characters are touching and overlapping. The number of combinations of overlapping characters is virtually impossible to train for. The 'g' and 'y' characters are the worst offenders.
A double line spaced version of this would probably OCR reasonably well.
A custom solution that segmented and separated the each line along with a good dictionary would definitely improve the results. There would still be some errors to correct manually though. The custom routine would have to deal with the ascenders and descenders and try and segment the image into lines which can then be fed to a decent OCR engine. One way would be to analyse every character blob on the page and allocate it to a line. Leptonica (www.leptonica.com - C Imaging Library) would probably make this job a little easier.
I would not try this without increasing the resolution to 200 or 300 dpi first.
With this custom solution, training a font becomes an option if the OCR engine does a poor job initially.
Abbyy (www.abbyy.com) or Google Tesseract OCR 3.00 would be a good place to start.
No guarantees as to whether all of this will work though. This is quite a difficult page to OCR and you need to work out whether it is better to have it typed up manually overseas. It depends on the number of pages to need to process.

How do you scale an image for print without degrading the quality?

I was wondering how would you print an image that's scaled three times its original size without making it look like crap? If you change the dpi to 300 and print it'll look like crap. Is there a way to convert it gracefully?
You may have the problem of trying to add detail that isn't there. Hopefully you're aware of this.
The best way to enlarge an image that I know of is to use bicubic interpolation. If it's any help, Photoshop recommends using 'bicubic smoother' for enlargement.
Also, be careful with DPI vs PPI.
This is called supersampling or interpolation. There's no 'perfect' algorithm, since that would imply generating new information where there was none ('between' the pixels); but some methods are better than others in fooling the eye/brain to fill the voids, or at least not making big square boxes.
Start with the wikipedia articles on Nearest-Neighbor, Bilinear and Bicubic interpolations (the three offered by PhotoShop). A few more Tricubic interpolation, Lanczos resampling could be of interest, also check the theory, and comparison links.
In short, this isn't a cut-and-clear issue; but an active investigation field, full of subjectivity and practical trade-offs.
You should vectorize your image, scale it, and if you wish you may convert it back to the original format (jpg, gif, png...). However this works best for simple images.
Do you know how to vectorize? There are some sites that do it online, just do some Google research and you'll find some.
Changing the DPI won't matter if you don't have enough pixels in your image for the size you are printing. In the biz it's called GIGO (Garbage In, Garbage Out).
If your image is in HTML then create a media="print" stylesheet and feed a high-res image that way.

Resources