OCR text line separation and characters separation

OCR text line separation and characters separation - image-processing

I am creating a Sanskrit OCR.
So, I just want to know how to do text line separation from image, then text line to characters separation and how to separate symbols added with characters?
I tried hough transform with a threshold of 50 but it's not much accurate.

Related

How can i convert cahracters drawn with draw(rect:) into strings?

I am working on a project where I need to draw alphabetic characters using draw(rect:) method. Until now I am able to draw these hand-drawn characters but I am unable to export these characters as strings. I tried to get the drawn alphabets using screenshots and OCR but all other algorithms just detect the test and resample the text to redefined shapes whereas, i need to convert the drawn character as it to the string or ".ttf".

If your goal is to have the user draw characters and turn their drawings into a font, you have your work cut out for you.
I suggest you have the user draw one character at a time. Since you know which one they are drawing, you know what character you're getting.
You'd record the series of gestures the user makes to draw the character, not the resulting pixels.
You'd then need to convert their raw gestures into a series of bezier curves. Then you'd need to encode those bezier curves into a glyph in the font you are building.
I've never figured out how to generate Bezier curves from points along the curve, since the middle control points for a Bezier curve don't lie along the curve. I've read that it's possible however.
As to encoding the curves into a glyph in TrueType font format, I got nuttin'. I suggest googling it.

Method to separate dotted numbers from an image

I'm going to do a specialized OCR system which recognizes the above dotted numbers. (The sample picture may not contain all special cases - see below. ) We decided to separate the number string and recognize each digit before we put them altogether to form a final result.
The question is:
How to clearly separate all digits with OpenCV or other image algorithms?
Our difficulty lies in:
1. The image I uploaded is a synthesized image, which was produced using handpicked digits with slight morph in order to simulate anomalies in actual use, e.g. some dots are linked as a whole, some dots are eroded, and some dots are biased. We failed using morphology to determine their contours.
2. However, sometimes the digit may skew too much like italics with kerning, making a "clean and complete" bounding box impossible.
Some of the ideas we thought of are:
1. Find a way to draw slanted lines to separate the digits instead of traditional vertical lines. We assume that these dotted numbers should have been straight-up monospace characters, and only shear will occur instead of rotation.
2. If there are any method better than simple morphology that could link the dots of each number together and manage to keep dots of separate digits away, it will also be useful.
EDIT: Please don't comment below the original question. Just submit your answer. I appreciate every help by you, no matter how simple your answer may seem to be.
EDIT: Since the image I provided is somewhat ideal for real situation, a simple morphological operation won't solve the problem. Also, I'm looking for a solution which separates the characters, and linking the dots together is not the only option.

cursive character segmentation in OCR

I have done a OCR application for handwritten normal characters.For the segmentation of characters I have used histogram profile method. That successfully works for normal English characters.
I have used horizontal projection for line segmentation and vertical projection for character segmentation.
To segment lines of cursive hand written article I can use horizontal projection as previous. But I can't use same methodology for cursive English character segmentation since they are merged each other and also slanted. Can anyone please help me with a way to segment cursive characters.

This is a difficult problem to solve due to the variability between writers and character shapes. One option, which has achieved up to 83% accuracy, is to analyze the ligatures (connections between characters) in the writing and draw columns on the image using those ligatures as a base point. In 2013, Procedia Computer Science proposed this approach and published their research on this particular problem: https://ac.els-cdn.com/S1877050913001464/1-s2.0-S1877050913001464-main.pdf?_tid=5f55eac2-0077-11e8-9d79-00000aacb35f&acdnat=1516737513_c5b6e8cb8184f69b2d10f84cd4975d56
Another approach to try is called skeletal analysis which takes the word as a whole and matches its shape with other known word shapes and predicts the word based on the entire image.
Good luck!

How to detect exact, predefined shapes with hough transform, like a "W"?

Let's say I have some system that scans documents, where all documents use the same font and font size.
In these documents, there will always be the same looking letter "W". Let's say it is always 20 px large. How can I set up the hough transform to recognize this letter "W" at 20 px large in my documents?

A quick Google search yields the following information of interest:
Generalizing the Hough Transform to Detect Arbitrary Shapes
and it looks like a lecture using the above paper as its source.
Also, if it's an actual "W", would an OCR engine like Tesseract be better suited to your needs?

The Hough transform for lines finds best fit line equations. You would need to do additional processing to find just the line segments. If the character thickness is several pixels, then to effectively find lines you might want to reduce the thickness to one pixel. There are techniques to do that, but also various algorithmic traps.
Once you have your line segments, you would still have to write an algorithm to identify characters based on the relative position and angle of the line segments. It's harder than it first appears.
A normalized cross-correlation (template matching) could work if you're certain that the image will always be in a certain rotation, the characters will always be the same size, etc. But even for scans you'll see some rotation and some variation in contrast.
All that aside, it's likely cheaper in the long run to use a commercial OCR package or reasonably good open source project. OCR is hard to implement if you're not already familiar with image processing.

Image processing / super light OCR

I have 55 000 image files (in both JPG and TIFF format) which are pictures from a book.
The structure of each page is this:
some text
--- (horizontal line) ---
a number
some text
--- (horizontal line) ---
another number
some text
There can be from zero to 4 horizontal lines on any given page.
I need to find what the number is, just below the horizontal line.
BUT, numbers strictly follow each other, starting at one on page one, so in order to find the number, I don't need to read it: I could just detect the presence of horizontal lines, which should be both easier and safer than trying to OCR the page to detect the numbers.
The algorithm would be, basically:
for each image
count horizontal lines
print image name, number of horizontal lines
next image
The question is: what would be the best image library/language to do the "count horizontal lines" part?

Probably the easiest way to detect your lines is using the Hough transform in OpenCV (which has wrappers for many languages).
The OpenCV Hough tranform will detect all lines in the image and return their angles and start/stop coordinates. You should only keep the ones whose angles are close to horizontal and of adequate length.
O'Reilly's Learning OpenCV explains in detail the function's input and output (p.156).

If you have good contrast, try running connected components and analyze the result. It can be an alternative to finding lines through Hough and cover the case when your structured elements are a bit curved or a line algorithm picks up the lines you don’t want it to pick up.
Connected components is a super fast, two raster scan algorithm and will give you a mask with all you connected elements in it marked with different labels and accounted for. You can discard anything short ( in terms of aspect ratio). Overall, this can be more general, faster but probably a bit more involved than running Hough transform. The Hough transform on the other hand will be more tolerable for contrast artifacts and even accidental gaps in lines.
OpenCV has the function findContours() that find components for you.

you might want to try John' Resig's OCR and Neural Nets in Javascript

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart