pytesseract OCR - Extract the font-size from the images - font-size

I have extracted text from images using Pytesseract OCR ( A python Wrapper of Tesseract). Now i want to find the approximate Font size used in the input image. I searched over the internet and found that there was previously a function "WordFontAttributes" but it is no more available. Any alternate to this please

Related

Improving the quality of OCR using pytesseract

I'm trying to use pytesseract to recognize text from this image but I'm unable to get satisfactory results.
I've tried a number of things to make it easier for the tesseract to recognize the text. My tesseract version is 5.0
Taken color out of the image, leaving it only with black and white
Converting it into grayscale and then reading it
Attempted a gaussian blur
Blew up the image to ensure that it could read it more efficiently
Tried an inverse threshold to make the images stand out more but still no positive outcome.
Image
Black and White Image
Black and White Binary image

OCR on antialiased text

I have to OCR table from PDF document. I wrote simple Python+opencv script to get individual cells. After that new problem arose. Text is antialiased and not good-quality.
Recognition rate of tesseract is very low. I've tried to preprocess images with adaptive thresholding but results weren't much better.
I've tried trial version of ABBYY FineReader and indeed it gives fine output, but I don't want to use non-free software.
I wonder if some preprocessing would solve issue or is it nessesary to write and learn other OCR system.
If you look closely at your antialiased text samples, you'll notice that the edges contain a lot of red and blue:
This suggests that the antialiasing is taking place inside your computer, which has used subpixel rendering to optimise the results for your LCD monitor.
If so, it should be quite easy to extract the text at a higher resolution. For example, you can use ImageMagick to extract images from PDF files at 300 dpi by using a command line like the following:
convert -density 300 source.pdf output.png
You could even try loading the PDF in your favourite viewer and copying the text directly to the clipboard.
Addendum:
I tried converting your sample text back into its original pixels and applying the scaling technique mentioned in the comments. Here are the results:
Original image:
After scaling 300% and applying simple threshold:
After smart scaling and thresholding:
As you can see, some of the letters are still a bit malformed, but I think there's a better chance of reading this with Tesseract.

Difficult text recognition using tesseract

Some text images are not recognized by tesseract.
FOr example consider the following rails image which is not recognized by tesseract
The above image when OCRed, gives no output.
And some images accuracy is not upto the mark.
I am using ruby on rails and to implement tesseract OCR text recognition I am using 'gem tesseract' and some code.
What's the problem and how do I get the output with nice accuracy.
The problem is that Tesseract is meant for images with only text. Results for images like the one you have posted are not guaranteed.
You will need to do some image processing (crop the image to only the text part), and convert the image to black-text-on-white-background.

Tesseract works for images that contains only and only text- Crop image to get only the text part from image

Tesseract works for images that contains only and only text. But what if there is text and image and we want to get only text to be recognized.
I am using Tesseract for OCR recognition of text from image. Tesseract is giving exact text from the images that are having only text in them. However when I checked the image that contains car and its car number, Tesseract gave different garbled text for the car number. I applied gray scale optimization, threshold and other effects to get the exact text output and to increase the accuracy of the output but it still giving different text mixed with different encoding. For the same, I am looking for other ways to extract such text.
Can anyone know that how to get text from such images using Tesseract OCR or any alternative so that only text part remains in image so that Tesseract can give the exact text in output.
To crop the image is one alternative to get the only text but how to do that using ImageMagick or any other option.
Thanks.
If you know exactly where on the image the text is, you can send along with the image the coordinates of those regions to Tesseract for recognition. Take a look at Tesseract API method TesseractRect or SetRectangle.

How to train tesseract to recognize small numbers in low DPI?

I get data from video so there is not way for me to rescan the image, but I can scale them if necessary.
I do have only a limited number of characters, 1234567890:, but I have no control over the dpi of the original image or the font.
I tried to train tesseract but without any visible effect, the test project is located at https://github.com/ssbarnea/tesseract-sample but the current results are really bad.
Example of original image being captured:
Example of postprocessed image for OCR:
How can I improve the OCR process in this case?
You can try to add some extra space at the edges of the image, sometimes it helps for tesseract. However, opensource OCR engines are very sensitive to the source image DPI.

Resources