image segmentation for connected character in morphology - image-processing

before i asked a same similar question,i tried using a watershed to segmentation the connected character but it does not well.a weeks ago,i get same question at stackoverflow in google search,Segmentation for connected characters,
in the answer users,the author mmgp provide a solution that use a morphology method and closing operation but i not understand all.
i just thinning a image in hit-and-miss morphology.
the original image
the thinning image the big image for the thinning image (enlarge)
the 4-connectivity can split a digit 9 to individual character but 44 still connected.
i have a some of question about Segmentation for connected characters
1.why need resize the original image to 200-pixel and then thinning it.
why not thinning the original image by immediate.
2.how extract these branch points and apply a morphological closing to thinning image.
i just know the closing morphology is a erosion and dilation combine operation.
the closing's vertical line need a 2*height+1(this a structure element height?),i don't know and how setting.the structure element how to constructre(3*3 or other?).
the finally they get a following image
i need some help, someone can tell me how apply closing operation and get above a image.
thanks.

i have solved this problem use a foreground-feature and background-feature.
some of people that details about this algorithm below:
Agenetic framework using contextual knowledge for segmentation and recognition of handwritten numeral strings
Segmentation of Handwritten Numeral Strings in Farsi and English Languages.
the flowing image is my capture.
foreground-region and foreground-skeleton
background-region and background-skeleton
the skeleton image for 44.
based on the above feature-points ,we can constructing a segmentation path to split 449 digit.

Use the below method for closing operation:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(2*h+1,1))
closed_img = cv.morphologyEx(img, cv.MORPH_CLOSE, kernel)

Related

How can I improve Tesseract results quality?

I'm trying to read the NIRPP number (social security number) from a French vital card using Tesseract's OCR (I'm using TesseractOCRiOS 4.0.0). So here is what I'm doing :
First, I request a picture of the whole card :
Then, using a custom cropper, I ask the user to zoom specifically on the card number:
And then I catch this image (1291x202px) and using Tesseract I try to read the number:
let tesseract = G8Tesseract(language: "eng")
tesseract?.image = pickedImage
tesseract?.recognize()
print("\(tesseract?.recognizedText ?? "")")
But I'm getting pretty bad results... only like 30% of the time Tesseract is able to find the right number, and among these sometimes I need to trim some characters (like alpha characters, dots, dashes...).
So is there a solution for me to improve these results?
Thanks for your help.
To improve your results :
Zoom your image to appropriate level. Right amount of zoom will improve your accuracy by a lot.
Configure tesseract so that only digits are whitelisted . I am
assuming here what you are trying to extract contains only digits.If
you whitelist only digits then it will improve your chances of
recognizing 0 as 0 and not O character.
If your extracted text matches a regex, you should configure
tesseract to use that regex as well.
Pre process your image to remove any background colors and apply
Morphology effects like erode to increase the space between your
characters/digits. If they are too close , tesseract will have
hard time recognizing them correctly. Most of the image processing
library comes prebuilt with those effects.
Use tiff as image format.
Once you have the right preprocessing pipeline and configuration for tesseract , you will usually get a very good and consistent result.
There are couple of things you need to do it....
1.you need to apply black and white or gray scale on image.
you will use default functionality like Graphics framework or third party libray like openCV or GPUImage for applying black&white or grayscale.
2.and then apply text detection using Vision framework.
From vision text detection you can crop texts according to vision text detected coordinates.
3.pass this cropped images(text detected) to TesseractOCRiOS...
I hope it will work for your use-case.
Thanks
I have a similar issue. I discovered that Tesseract recognizes a text only if the given image contain a region of interest.
I solved the problem using Apple' Vision framework. It has VNDetectTextRectanglesRequest that returns CGRect of detected text according to the image. Then you can crop the image to region where text is present and send them to Tesseract for detection.
Ray Smith says:
Since HP had independently-developed page layout analysis technology that was used in products, (and therefore not released for open-source) Tesseract never needed its own page layout analysis. Tesseract therefore assumes that its input is a binary image with optional polygonal text regions defined.

How can I tell Tesseract that my font has a particular size?

I have a collection of type-written image captions which look like this:
I know that the typewriter is consistent and monospace, with characters measuring 14x22px (as measured from the top of a capital letter to the bottom of a descender).
Tesseract is producing output like this:
The results are mostly good when Tesseract has detected the correct bounding boxes for the letters. But there are many strings of letters which are clumped together (e.g. "Ea", "tree", "fr" and "om" on the first line). These are always transcribed incorrectly and account for the majority of errors.
This is frustrating because I know a priori that all the characters are of a particular size. Is it possible pass this knowledge on to the tesseract command line tool?
My command to generate the box file is:
tesseract foo.jpg foo batch.nochop makebox
If possible, I'd prefer to avoid training Tesseract on the font—I don't have any manually transcribed samples, so building a corpus of training data would require some effort.
I'm not sure that Tesseract throws connected characters completely off as Noremac said.
Actually I think that it includes a chopping of joined characters whenever the result of a word detection is unsatisfactory, as explained in the paragraph 4.1 of An Overview of the Tesseract OCR Engine
And I also think that once it finds a fixed pitch text, it should automatically chop the text, even if the characters are connected (look at figure 2 of the same paper).
I know that it's a little bit late to add this answer, but maybe it will help some future visitors!
The issue isn't the font size as much as it is with the letters connecting. If you zoom in on the above images with a program that will show the actual pixels (rather than blurring them together) you can see that those grouping two characters are actually connected. tessearctOCR is completely based on connected components so if they are connected at all then it throws it completely off. I see a couple of options:
If possible, give it a higher resolution image where there is more separation between the characters
Adjust the preprocessing to do a more strict threshold.
I noticed that the pixel connecting the E and the a on the first occurrence is lighter so adjusting the threshold will remove that connection. However, this could affect more than what you want, such as disjointing characters where you don't expect.
For updating the thresholding consider this: https://groups.google.com/forum/#!topic/tesseract-ocr/JRwIz3xL45U

image segmentation when a characters connected in aforge and c#

I have the following image and has binarization。
i need to segmentation this image and recognized the digit.the double digits 4' and '9' that are connected together.
i read a some of document that mention about 'watershed morphology' method.the following image has be implemented a 'watershed segmentation'.
it's obvious that double 44 digits still connected but a 9 digit already segmentation to success.
i need some help how to segmentation a 44 characters!thanks.
zhengchun,
you need to understand that this is a quite difficult task which, in my opinion, cannot be perfectly solved in all cases.
In the first place, correctly splitting between characters without prior knowledge on their size and shape is just impossible: just consider the letter W, it could very well be split into two V's; on the opposite, nothing can tell you that two accidentally touching IJ are indeed two different letters rather then a U.
This means that no "blind" method like the watershed or any other can succeed, whatever the sophistication. Geometry alone is not enough, you need to rely on some description of the font (sizes and shapes).
To the best of my knowledge, you must let segmentation and recognition work together. What you can do is:
use the initial segmentation, hoping that touching and broken characters do not arise so often;
starting from the left, try immediate character recognition by splitting after one character width (you will need to try every font character in turn, possibly with different widths);
keep the most likely recognition result and continue recognition from that split, to the right;
if you expect broken characters, you can as well try recognitions that span two or more blobs and group these. (Gaps between blobs are good hints for splits, unless your characters can be broken or miss parts.)
You can improve the above procedure by adding heuristics to decide where splits are more likely, such as at a height minimum, but this is tricky. A pinch of black magic...

how do I identify letters in an image? (before OCRing)

all I can find in the web is about OCR but I'm not there yet, I still have to recognize where the letters are in the image.
any help will be appreciated
The interesting thing is that the answer is not that simple as it may seem. Some may think that locating characters on the picture is first step of OCR, but it is not the case. Actually, you won't be sure where each character is located until you actually finish with recognizing.
The way it works completely depends on the type of image you are going to recognize. First you should segment you image on text areas (blocks) and everything other.
Just few examples:
If you are recognizing license plate on car picture, you should first locate license plate, and only then split it to separate characters.
If you are recognizing some application form, you can locate areas where text is just by knowing it's layout
If you are recognizing scan of book page, you have to distinguish pictures from text areas and then work only on text.
Starting from this moment you don't need original image any more, all you need is binarized image of text block. All OCR alorithms work on binary images. You may need also doing other kind of image transformations like line straightening, perspective correction, skew correction and so on - all that again depends on type of images you are recognizing.
Once text block is found and normalized, you should go further and find lines of text on the text block. In trivial case of horisontal lines of text it is quite simple by creating pixel histogram by horisontal lines.
Now, when you have lines, you may think that now it is simple, you can split it to characters, huray! Again, it is wrong. There are such phenomena as connected characters, broken characters and even ligatures (two letters forming one single shape), or letter that have their parts go further to the right above or bellow next character. What you should do is to create several hipotesis of splitting line to words and individual characters, then try OCR every single variant, weight every hypotesis with confidence level. Last step would be checking different paths in this graph using dictionary and selecting best one.
And only now, when you actually recognized everything, you can say where individual characters are located.
So, simple answer is: recognize your image with OCR program, and get coordinates of charaters from it's output.
Generally speaking you'll be looking for small contiguous areas of nearly solid color. I would suggest sampling each pixel and building an array of nearby pixels that also fall within a threshold of the original pixels color (repeat for neighbours of each matching pixel). Put the entire array aside as a potential character (or check it now) and move on (potentially ignoring previously collected pixels for a speedup).
Optimisations are possible if you know in advance the font-size, quality and/or color of the text. If not you'll want to be fairly generous with your thresholds of what constitutes a "contiguous area".

Improve OCR accuracy from scanned documents

I'm scanning a lot of A3 documents using a standard Brother A3 Multifunction and then use FineReader Pro for OCR'ing the images.
However, I'm getting a lot of errors in the characters recognized, and lots of non-alphanumeric strange characters.
Can someone give me any tips for programmatically improving the OCR accuracy, either pre-processing on the scanned images, or post-processing on the recognized text?
Edit: Find a sample pdf. It includes some sample images from which I get the poorest results.
Do you have a sample image you can post somewhere then we can quickly tell you what is causing most of your problems. FineReader is one of the better OCR engines out there so there are definitely reasons why you are getting poor results.
It could be related to poor contrast and threshold settings, image skewing, dirty rollers in the scanner, complex and coloured backgrounds, dithered backgrounds, font sizes too small, scanning dpi being too low etc...
After seeing the attached image there are a few small issues.
There are lots of dirty specks on the background page. FineReader seems to do a reasonable job with this on your images.
There is some slight skew but that is not causing and problems.
FineReader is getting confused with BOLD tall Arial type font used for column headers.
4 A big problem seems to be the bottom region of the pages where the contrast is poor and the image is fuzzy. This seems to be a problem with the scanner but could be due to printing problems.
The printing is quite poor and I am guessing it is a scan from a newspaper. Most of your errors are due to scanning issues so it would be hard to programmatically improve the results.
Firstly, I would try scanning the image in grayscale using a slightly higher resolution and see if that helps. FineReader works well with grayscale images. If you have to have a B/W image then see if the scanner driver includes a setting for dynamic thresholding and turn it on.
Your images would not be an easy task for any OCR engine. You will get better results if you can improve the scanning. Page 3 has a lot of noise in the bottom right corner.
What version of FineReasder are you using ? FR10 would probably give better results than previous versions.

Resources