Need advice on training Tesseract OCR (text with conversion/compression artifacts) - image-processing

I need to do OCR on images that have gone through a digital to analog (interlaced video) to digital conversion, then jpeg compressed (resulting in compression artifacts). I have not been able to locate the exact fonts used, but we'll be looking at a mix of sans serif - e.g., Arial, Calibri, and Tiresias might work well as a training set. There is no way to get around the jpeg compression. These are text-only, white-on-black images at standard def resolution (720x480 deinterlaced).
An example is located here, resized at 1000%:
I've found a preprocessing pipeline that works fairly well for Tesseract:
Resize to 400-600%
Blur
Threshold (binarization)
Erode (get thinner stroke width)
One problem is that letters like 't' and 'f' end up with a diamond shape at the cross. Still, this process works well, but isn't quite perfect. So I'd like to train tesseract. My question:
How should I create the training set?
Should I try to emulate the analog-to-digital-to-analog by adding a small amount of noise, then compress with jpeg? Should I do preprocessing on my training set, similar to what I listed above? If I train with noisy jpeg compressed images to match my captured images, is it best to skip preprocessing on the captured images?
Additionally, any hints on getting rid of the conversion/compression artifacts without sacrificing the text would be appreciated.

Related

Why image compression doesnt use overlapped data?

For example in audio codecs like Opus, MDCT is used with 50% percent overlap to avoid ringing artifacts. Why a similar approach is not used in image codecs. e.g., JPEG uses non-overlapping 8x8 blocks ?
Later lossy image codecs like JPEG2000 do use overlapped transforms, but these techniques just weren't around when JPEG was being defined. The wavelet transform that JPEG2000 is based on hadn't been invented yet, and time-domain anti-aliasing techniques like MDCT were extremely new.
For the MDCT in particular, as far as I know it is not used for image compression at all, even Today. I would guess that's because its basis vectors are asymmetric, which makes it intuitively difficult to choose for imaging applications.

JPG compression and noise

I am studying jpeg compression and it seems to work by reducing high frequency components in images. Since noise is usually high frequency, does this imply that jpeg compression somewhat works on reducing noise in images?
JPEG compression can reduce noise by smoothing out the high-frequency components of the image, but it also introduces visual noise in the form of compression artifacts. Here is a zoomed-in (3x) view of part of my avatar (a high-quality JPEG) and part of your avatar (a PNG drawing), on the left as downloaded and on the right as compressed with ImageMagick using -quality 60. To my eye they both look "noisier" when JPEG-compressed.
Strictly speaking, no.
JPEG does remove high frequencies (see below), but not selectively enough to be a denoising algorithm. In other words, it will remove high frequencies if they are noise, but also if they are useful detail information.
To understand this, it helps to know the basics of how JPEG works. First, the image is divided in 8x8 blocks. Then the discrete cosine transform (DCT) is applied. As a result, each element of the 8x8 block contains the "weight" of a different frequency. Then the elements are quantized in a fixed way depending on the quality level selected a priori. This quantization means gaining coding performance at the cost of losing precision. The amount of precision lost is fixed a priori, and (as I said above) it does not differenciate between noise and useful detail.
You can test this yourself by saving the same image with different qualities (which technically control the amount of quantization applied to each block) and see that not only noise is removed. There is a nice video showing this effect for different quality levels here: https://upload.wikimedia.org/wikipedia/commons/f/f3/Continuously_varied_JPEG_compression_for_an_abdominal_CT_scan_-_1471-2342-12-24-S1.ogv.

OCR on antialiased text

I have to OCR table from PDF document. I wrote simple Python+opencv script to get individual cells. After that new problem arose. Text is antialiased and not good-quality.
Recognition rate of tesseract is very low. I've tried to preprocess images with adaptive thresholding but results weren't much better.
I've tried trial version of ABBYY FineReader and indeed it gives fine output, but I don't want to use non-free software.
I wonder if some preprocessing would solve issue or is it nessesary to write and learn other OCR system.
If you look closely at your antialiased text samples, you'll notice that the edges contain a lot of red and blue:
This suggests that the antialiasing is taking place inside your computer, which has used subpixel rendering to optimise the results for your LCD monitor.
If so, it should be quite easy to extract the text at a higher resolution. For example, you can use ImageMagick to extract images from PDF files at 300 dpi by using a command line like the following:
convert -density 300 source.pdf output.png
You could even try loading the PDF in your favourite viewer and copying the text directly to the clipboard.
Addendum:
I tried converting your sample text back into its original pixels and applying the scaling technique mentioned in the comments. Here are the results:
Original image:
After scaling 300% and applying simple threshold:
After smart scaling and thresholding:
As you can see, some of the letters are still a bit malformed, but I think there's a better chance of reading this with Tesseract.

Image preprocessing for text recognition

What's the best set of image preprocessing operations to apply to images for text recognition in EmguCV?
I've included two sample images here.
Applying a low or high pass filter won't be suitable, as the text may be of any size. I've tried median and bilateral filters, but they don't seem to affect the image much.
The ideal result would be a binary image with all the text white, and most of the rest black. This image would then be sent to the OCR engine.
Thanks
There's nothing like the best set. Keep in mind that digital images can be acquired by different capture devices and each device can embed its own preprocessing system (filters) and other characteristics that can drastically change the image and even add noises to them. So every case would have to be treated (preprocessed) differently.
However, there are commmon operations that can be used to improve the detection, for instance, a very basic one would be to convert the image to grayscale and apply a threshold to binarize the image. Another technique I've used before is the bounding box, which allows you to detect the text region. To remove noises from images you might be interested in erode/dilate operations. I demonstrate some of these operations on this post.
Also, there are other interesting posts about OCR and OpenCV that you should take a look:
Simple Digit Recognition OCR in OpenCV-Python
Basic OCR in OpenCV
Now, just to show you a simple approach that can be used with your sample image, this is the result of inverting the color and applying a threshold:
cv::Mat new_img = cv::imread(argv[1]);
cv::bitwise_not(new_img, new_img);
double thres = 100;
double color = 255;
cv::threshold(new_img, new_img, thres, color, CV_THRESH_BINARY);
cv::imwrite("inv_thres.png", new_img);
Try morphological image processing. Have a look at this. However, it works only on binary images - so you will have to binarize the image( threshold?). Although, it is simple, it is dependent on font size, so one structure element will not work for all font sizes. If you want a generic solution, there are a number of papers for text detection in images - A search of this term in google scholar should provide you with some useful publications.

How to train tesseract to recognize small numbers in low DPI?

I get data from video so there is not way for me to rescan the image, but I can scale them if necessary.
I do have only a limited number of characters, 1234567890:, but I have no control over the dpi of the original image or the font.
I tried to train tesseract but without any visible effect, the test project is located at https://github.com/ssbarnea/tesseract-sample but the current results are really bad.
Example of original image being captured:
Example of postprocessed image for OCR:
How can I improve the OCR process in this case?
You can try to add some extra space at the edges of the image, sometimes it helps for tesseract. However, opensource OCR engines are very sensitive to the source image DPI.

Resources