Improving the quality of OCR using pytesseract

Improving the quality of OCR using pytesseract - opencv

I'm trying to use pytesseract to recognize text from this image but I'm unable to get satisfactory results.
I've tried a number of things to make it easier for the tesseract to recognize the text. My tesseract version is 5.0
Taken color out of the image, leaving it only with black and white
Converting it into grayscale and then reading it
Attempted a gaussian blur
Blew up the image to ensure that it could read it more efficiently
Tried an inverse threshold to make the images stand out more but still no positive outcome.
Image
Black and White Image
Black and White Binary image

Related

Use OpenCV to detect text blocks to send to Tesseract iOS

How can I use OpenCV to detect all the text in an image, I want to to be able to detect "blocks" of texts individually. Then pass the the recognized blocks into tesseract. Here is an example, if I were to scan this I would want to scan the paragraphs separately, not go from left to right which is what tesseract does.
Image of the example

That would be my first test:
Threshold the image to get a black and white image, with the text in black
Erode it until the paragraph converts to a big blob. It may have lots of holes, it doesn't matter.
Find contours and the bounding box
If some paragraphs merge, you should erode less or dilate a little bit after the erode.

Sharpening image using OpenCV OCR

I've been trying to work on an image processing script /OCR that will allow me to extract the letters (using tesseract) from the boxes found in the image below.
Following alot of processing, I was able to get the picture to look like this
In order to remove the noise I inverted the image followed by floodfilling and gaussian blurring to remove noise. This is what I ended up with next.
After running it through some threholding and erosion to remove the noise (erosion being the step that distorted the text) I was able to get the image to look like this before running it through tesseract
This, while a pretty good rendering, allows for fairly accurate results through tesseract. Though it sometimes fails because it reads the hash (#) as a H or W. This leads me to my question!
Is there a way using opencv, skimage, PIL (opencv preferably) I can sharpen this image in order to increase my chances of tesseract properly reading my image? OR Is there a way I can get from the third to final image WITHOUT having to use erosion which ultimately distorted the text in the image.
Any help would be greatly appreciated!

OpenCV does has functions like filter2D that convolves arbitrary kernel with given image. In particular you can use kernels that are used for image sharpening. The main question is whether this will improve the results of your OCR library or not. The image is already pretty sharp and the noise in the image is not a result of blur. I never worked with teseract myself, but I am fairly sure that it already does all the noise reduction it could. And 'helping' him in this process may actually have opposite effect. For example any sharpening process tends to amplify noise (as opposite to noise reduction processes that usually are blurring images). Most of computer vision libraries give better results when provided with raw (unprocessed) images.
Edit (after question update):
There multiple ways to do so. The first one that I would test is this: Your first binary image is pretty clean and sharp. Instead of of using morphological operations that reduce quality of letters switch to filtering contours. Use findContours function to find all contours in the image and store their hierarchy (i.e. which contour is inside which). From all the found contours you actually need only the contours on first and second levels, i.e. outer and inner contours of each letter (contours at zero level are the outermost contours). Other contours can be discarded. Among the contours that do belong to first level you can discard those whose bounding box is too small to be a real letter. After those two discarding procedures I would expect that most of the remaining contours are the ones that are parts of the letters. Draw them on white image and run OCR. (If you want white letters on black background you will need to invert the order of vertices in the contours).

Tesseract works for images that contains only and only text- Crop image to get only the text part from image

Tesseract works for images that contains only and only text. But what if there is text and image and we want to get only text to be recognized.
I am using Tesseract for OCR recognition of text from image. Tesseract is giving exact text from the images that are having only text in them. However when I checked the image that contains car and its car number, Tesseract gave different garbled text for the car number. I applied gray scale optimization, threshold and other effects to get the exact text output and to increase the accuracy of the output but it still giving different text mixed with different encoding. For the same, I am looking for other ways to extract such text.
Can anyone know that how to get text from such images using Tesseract OCR or any alternative so that only text part remains in image so that Tesseract can give the exact text in output.
To crop the image is one alternative to get the only text but how to do that using ImageMagick or any other option.
Thanks.

If you know exactly where on the image the text is, you can send along with the image the coordinates of those regions to Tesseract for recognition. Take a look at Tesseract API method TesseractRect or SetRectangle.

Image preprocessing for text recognition

What's the best set of image preprocessing operations to apply to images for text recognition in EmguCV?
I've included two sample images here.
Applying a low or high pass filter won't be suitable, as the text may be of any size. I've tried median and bilateral filters, but they don't seem to affect the image much.
The ideal result would be a binary image with all the text white, and most of the rest black. This image would then be sent to the OCR engine.
Thanks

There's nothing like the best set. Keep in mind that digital images can be acquired by different capture devices and each device can embed its own preprocessing system (filters) and other characteristics that can drastically change the image and even add noises to them. So every case would have to be treated (preprocessed) differently.
However, there are commmon operations that can be used to improve the detection, for instance, a very basic one would be to convert the image to grayscale and apply a threshold to binarize the image. Another technique I've used before is the bounding box, which allows you to detect the text region. To remove noises from images you might be interested in erode/dilate operations. I demonstrate some of these operations on this post.
Also, there are other interesting posts about OCR and OpenCV that you should take a look:
Simple Digit Recognition OCR in OpenCV-Python
Basic OCR in OpenCV
Now, just to show you a simple approach that can be used with your sample image, this is the result of inverting the color and applying a threshold:
cv::Mat new_img = cv::imread(argv[1]);
cv::bitwise_not(new_img, new_img);
double thres = 100;
double color = 255;
cv::threshold(new_img, new_img, thres, color, CV_THRESH_BINARY);
cv::imwrite("inv_thres.png", new_img);

Try morphological image processing. Have a look at this. However, it works only on binary images - so you will have to binarize the image( threshold?). Although, it is simple, it is dependent on font size, so one structure element will not work for all font sizes. If you want a generic solution, there are a number of papers for text detection in images - A search of this term in google scholar should provide you with some useful publications.

Cleaning scanned grayscale images with ImageMagick

I have a lots of scans of text pages (black text on white background).
My usual approach is to clean those in Gimp using the Curves dialog using a pretty simple curve with only four points: 0,0 - 63,0 - 224,255, 255,255
This makes all the greyish text pitch black plus makes the text sharper and turns most of the whitish pixels pure white.
How can I achieve the same effect in a script using ImageMagick or some other Linux tool that runs completely from the command line?
-normalize or -contrast-stretch don't work because they operate with pixel counts. I need an operator which can make the colors 0-63 (grayscale) pitch black, everything above 224 pure white and the rest should be normalized.

The Color Modifications page shows many color manipulation algorithms by ImageMagick.
In this specific case, two algorithms are interesting:
-level
-sigmoidal-contrast
-level gives you perfect black/white pixels near the ends of the curve and a linear distribution between.
The sigmoidal option creates a smoother curve between the extremes, which works better for color photos.
To get a similar result like in GIMP, you can try to apply one after the other (to make text and black areas really black).
In all cases, you will want to run -normalize first (or even -contrast-stretch to merge most of the noise) to make sure no black/white levels are wasted. Without this, the darkest color could be lighter than rgb(0,0,0) and the brightest color could be below pure white.

[magick-users] Curves in ImageMagick
The first link in that archived message is a shell script that I think does what you're looking for.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart