Segmentation/Matching arabic characters - opencv

I have many pictures like this and i want to segment each character alone by using any matching technique.
So I manually segmented a letter of this word , and i used a simple image difference of this image with slide in the original image of the same size.
This didn't work for all the words due to the variation of the character in the word itself as sometimes the window of the original image includes other characters.
So what's the optimum technique to match or to get the characters of the original image ?

Related

NSAttributedString: Strikethrough text with replacement text above it

I am trying to draw an NSAttributedString (actually, a constructed NSMutableAttributedString) where the "original" text has been struck and replacement text inserted above it (I'm trying to replicate the look/feel of an Ancient Greek manuscript).
My technique is a combination of NSBaselineOffsetAttributeName with NSKernAttributeName, but it appears that using a negative value for NSKernAttributeName "wipes away" the strikethrough of the text, even if the characters don't overlap.
If I put an extra space after the "A" character (in the original text), the "A" gets the strikethrough, but the "EI" is also offset to the right. So, it appears that the offset/kerning of the "EI" text affects how much of the strikethrough actually occurs.
Here's what I'd like to reproduce (I don't care about the angle; it's not about a picture-perfect reproduction; just the gist):
Here's what is currently happening:
This is when I add an extra space after the strikethrough:
So, the only other thing I can think of would be to render a separate NSAttributedString in the correct place, separate from the current one, but I have no idea how to calculate the location of a specific character in an NSAttributedString when it's drawn. I'm drawing to a PDF, not to any on-screen control like a UILabel. Alternatively, I could draw the "strikethrough" myself as a line, but that seems to still require knowing the coordinates for the text in question, which is calculated on-the-fly, and I hope to use this method to reproduce a large sample of ancient texts, which means doing it by hand just isn't a good answer here.
Anything I'm missing, or any out-of-the-box ideas to try?

Multiple fonts on a single line of an eps file?

I'm trying to manipulate and modify a large number of eps files output by a web service. I have the code in python to interface with the server, and to edit the images to within about 99% of where I want them.
My issue deals with trying to display special characters. I've tried outputting the desired unicode character, Ψ, directly into the eps file, but this is displayed as different characters in both Preview for Mac and Adobe Illustrator. From some googling, it seems like eps doesn't accept unicode characters. Fair enough...
My current solution is to try to change the font in the eps files. This works when I want to display the Ψ as a single character. I just switch the font, and then switch back. As below:
/Symbol findfont 10.5 scalefont setfont
-9.1 36.1557 moveto (Y) show
/Helvetica-Bold findfont 10.5 scalefont setfont
Where I'm having a problem is when I want to display a title on each image. The title would include a mix of Helvetica-Light and Symbol fonts, I would want it to display as ABC1-Ψ123. I can't figure out how to switch the font of a single character within a larger block of text. Currently the eps files are written as shown:
/Helvetica-Light findfont 12 scalefont setfont
-43.093 304.224 moveto (ABC1-Y123) show
I'd appreciate any help.

Change the height of individual characters using core text

I have a font where unfortunately the numbers and letters are different heights. I need to display a reference code which is a mix of letters and numbers and the uneven heights of the characters looks jarring. Is it possible with core text (or another technology on iOS) to render certain characters with a slightly stretched height so that it looks even numbers and letters are displayed together.
E.g i have the string '23Rt59RQ' I need the 2,3,5,9 to be rendered with a larger height.
AFAICT, there's nothing in the CGContext API (which is what you'd want to use for laying out sets of glyphs) which would directly, easily facilitate this.
If it's really very important to use the font you are using, you could make separate calls to CGContextShowGlyphsAtPositions for alphabetic and numeral characters, calling CGContextSetFontSize each time so that the end result ends up matching, but this is a lot of overhead for just drawing text, and will probably result in undesirable performance.
My real advice would be to pick a better font so that this isn't even an issue :)
In the end of used regex to identify the character groups and then created an attributed string varying the font size in the font given in the NSFontAttributeName attribute according to which characters were to be displayed.
Kinda hacky but it had the desired effect.

How to obtain plain 'globe' Unicode character

If you include Unicode characters in an NSString, a lot of them will take on the color set for that text - they're just regular glyphs for that font so they're displayed like any other character. But there are some Unicode characters that are colored, for example GLOBE WITH MERIDIANS which is a blue gradient with shadows. But I have seen this same glyph elsewhere that's a simple black outline without a shadow, for example in the iOS keyboard. I would like to use that glyph, but without the adornments, and without having to create and use an image. I wondered if a different font would render it in a different format, and while iOSFonts.com does show different styles (bolder, italics), they're all blue. Is it possible to get the simple plain version?
Surely it is possible, because that appears to be exactly what Apple has implemented with a Tip. Notice the globe is the exact same color as the text and it's included in the string along with all the other characters. Surely that's not a UIImage?
Character in different fonts:
EDIT: The solution provided in the linked question doesn't work for this character, as the variant character appears to be the exact same as the original - blue with shadows.
Unfortunately, iOS doesn't have a monochrome globe symbol you can use; the only built-in font that includes U+1F310 GLOBE WITH MERIDIANS is Apple Color Emoji.
If you really want a font that renders this character as a simple black outline, you could package a copy of Symbola (downloadable here) into your app.
Alternatively, you could make a bitmap image with the icon you want and use NSTextAttachment to put it into an attributed string. Apple is likely doing something along these lines, as many of their Tips include symbols that are definitely not Unicode characters:

Scan Business Card Tesseract and Leptonica iOS

I am trying to scan a business card using tesseract OCR, all I am doing is sending the image in with no preprocessing, heres the code I am using.
Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:#"eng+ita"];
tesseract.delegate = self;
[tesseract setVariableValue:#"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#.-()" forKey:#"tessedit_char_whitelist"];
[tesseract setImage:[UIImage imageNamed:#"card.jpg"]]; //image to check
[tesseract recognize];
NSLog(#"Here is the text %#", [tesseract recognizedText]);
Picture of card
This is the output
As you can see the accuracy is not 100%, which is not what I am concerned about I figure I can fix that with some simple per-processing. However if you notice it mixes the two text blocks at the bottom, which splits up the address, and possibly other information on other cards.
How can I possibly use Leptonica(or something else maybe OpenCV) to group the text somehow? Possibly send regions of text on the image individually to tesseract to scan?
I've been stuck on this problem for a while any possible solutions are welcome!
I would recommend using an algorithm called "Run Length Smoothing Algorithm" (RLSA). This algorithm is used in a lot of document image processing systems, though not every system expose it as part of its API.
The original paper was published in 1982 and requires payment. However, the same algorithm is cited by many other papers on document image processing, where you can easily find implementation details and improvements.
One such paper is this: http://www.sciencedirect.com/science/article/pii/S0262885609002005
The basic idea is to scan the document image row by row, recording the width of the gaps between letters.
Then, nearby text characters can be combined by filtering on the width of the gaps, and setting small gaps to the same color as the text. The result will be large connected components that represent:
Words,
By closing the gaps between characters,
Text lines,
By closing the gaps between words, and
Paragraphs
By scanning column by column and then closing the vertical gaps between text lines.
If you do not have access to any document image analysis libraries that expose this functionality, you can mimic the effect by:
Using morphological operations (morphological closing), and then
Perform connected-component labeling on the result.
Most image processing libraries, such as OpenCV, provides such functionality. It might be less efficient to take this approach because you will have to re-run the algorithm using different text gap sizes to achieve the different levels of clustering, unless the user provides your application with the text gap sizes.
I think you've hit on a fundamental problem with OCR - printed designs of this type use white space as meaningful delimiters, but the OCR software doesn't/can't understand that.
This is just a wild stab in the dark, but here's what I would try:
Starting in the upper left, build a box perhaps 1-5% of the size of the whole image. Send that to OCR, and see if you get what looks meaningful back. If not, expand until you get something.
Once you have something, start expanding the block in reasonable units until you stop getting new data back. You can, hopefully, decide this point is "meaningful white space", and now you can consider this processed text as "one block" and thus complete. Now start with whatever the next unprocessed part of the image is, and thus work your way through until you've got the whole image complete.
By working with a set of interlinking expanding boxes, the hope is you'll only get meaningful blocks of data grouped together. Working with your example, once you isolate the logo and process it (and the resulting gibberish), the next box will start with, say, the "N" in Noah. Then you expand out to the right until you've gotten the whole name.
With this done you go again and, hopefully, you'll get a bounding box that includes the "A" in Associate, and get that whole line.
A pixel at a time this would take too long with all those runs to the OCR, I'm sure, but there will surely be a trade-off in "size of chunks to expand per interval" and "amount of processing required".
I don't see why this approach wouldn't work for relatively normal print designs, like a regular style business card.
You can try HOCRText which return all the scanned words along with frame of each word in that image as an xml.
char *boxtext = _tesseract->GetHOCRText(0);
You can parse that xml to get each word and its frame.
Else if you need you can mention the frame in image which should tesseract scan.
_tesseract->SetRectangle(100, 100, 200, 200);
Set this frame before you call recognise. So tesseract will scan only that frame and return text at that frame.
There is a sample iOS application on Github that does this which might be helpful for you:
https://github.com/danauclair/CardScan
How does he read the business card? He writes the following, (or you can read it in the file: https://github.com/danauclair/CardScan/blob/master/Classes/CardParser.m )
// A class used to parse a bunch of OCR text into the fields of an ABRecordRef which can be added to the
// iPhone address book contacts. This class copies and opens a small SQLite databse with a table of ~5500
// common American first names which it uses to help decipher which text on the business card is the name.
//
// The class tokenizes the text by splitting it up by newlines and also by a simple " . " regex pattern.
// This is because many business cards put multiple "tokens" of information on a single line separated by
// spaces and some kind of character such as |, -, /, or a dot.
//
// Once the OCR text is fully tokenized it tries to identify the name (via SQLite table), job title (uses
// a set of common job title words), email, website, phone, address (all using regex patterns). The company
// or organization name is assumed to be the first token/line of the text unless that is the name.
//
// This is obviously a far from perfect parsing scheme for business card text, but it seems to work decently
// on a number of cards that were tested. I'm sure a lot of improvements can be made here.

Resources