iOS How to get all words coordinates in PDF page - ios

I have looked through many tutorials and usually stack users trow links to the pdfkitten, but as I've tested it I have not satisfied with result. So the search does not work with multiply word and etc.
So what I am looking for I need to get all words from the pdf page and highlight it if the words cross some rectangle.

I used PDFKitten for the same.
What I did was while scanning the PDF - Identify the words separated
by spaces.
Save the RenderingState(Model in PDFKitten code)word is
encountered save that word in a model with it's current
RenderingState (Model in PDFKitten code) which will be initial state.
When the complete word is found(space separated) again save the
current RenderingState as final state.
The code for converting RenderingState to actual view's frame using
above initial state and final state, is present in PDFKitten. You can
refer to that code.
apply current media box transform to frame.
And finally don't forget
to convert resulted frame into user's co-ordinate system. Otherwise
you will observe the reverse effect.

Related

iOS Vision: VNRecognizedText boundingBox(for:) method returning identical bounding box for any range

I'm using the iOS Vision framework to perform OCR via a VNRecognizeTextRequest call, and I'm trying to locate each individual character in the resulting VNRecognizedText observations. However, when I call the boundingBox(for range: Range<String.Index>) method on any VNRecognizedText object and for any valid range within the recognized text, I get the same bounding box back. This bounding box corresponds to the bounding box of the entire string.
Am I misunderstanding the boundingBox(for:) method, or is there some other way to get discrete location info for single characters within a recognized text observation?
Thanks in advance!
Edit:
After looking into this more, I've realized that there's some sort of link with word groups and whitespace.
Consider a recognized text observation with a string value of "Foo bar". Calling boundingBox(for:) for each character in "Foo" returns the exact same bounding box which, based on the dimensions, seems to correspond to the entire substring "Foo" instead of the single character whose range we pass into the boundingBox method. Then, in another bit of strange behavior, the boundingBox for the whitespace character is simply an empty region at the origin whose edges don't correspond with the substrings on either side of it. Finally, the behavior for the second substring is the same as the first: each character in "bar" has the same bounding box.
After hours of further investigation, I decided to get in touch with Apple Developer Tech Support. Sure enough, this is a bug! When VNRecognizeTextRequest.recognitionLevel is set to .accurate, as I had, the bug manifests. When recognitionLevel is set to .fast, the results behave as expected, with discrete bounding boxes per character.

Highcharts line color based on other data than value

I have a set of data consisting of 3 values, being time, dataValue, dataQuality. I'm trying to get the line color (not the marker because that is working) color set based on the dataQuality. The current graph is rendered based on time (x-axis) and dataValue (y-axis) and dataQuality is currently not used or shown.
What I would like tot have is the line color based on the data quality instead of the data value, all while rendering data value for the line.
Is this possible to achieve within hightcharts?
Edit: To clarify with an image, as requested. The red line is the only line I'd like tot render, the grey one is here solely to show the quality value for the data. So what I'd like is the red line to go grey (or whatever we decide it to be) as soon as the data from the other line drops below a threshold. The markers, even though done with paint here, are already done by setting the color value for individual markers grammatically when processing the incoming data. The grey line should not be rendered at all.
Alright, having a night of sleep over it helped. Instead of trying to do it the hard way, by using invisible values I though of another way to solve this.
I do know the timestamps for low quality points. I started using the timestamps on the x-axis to create zones with different colors. Which color for which zone is determined by the data quality.

Scan Business Card Tesseract and Leptonica iOS

I am trying to scan a business card using tesseract OCR, all I am doing is sending the image in with no preprocessing, heres the code I am using.
Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:#"eng+ita"];
tesseract.delegate = self;
[tesseract setVariableValue:#"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#.-()" forKey:#"tessedit_char_whitelist"];
[tesseract setImage:[UIImage imageNamed:#"card.jpg"]]; //image to check
[tesseract recognize];
NSLog(#"Here is the text %#", [tesseract recognizedText]);
Picture of card
This is the output
As you can see the accuracy is not 100%, which is not what I am concerned about I figure I can fix that with some simple per-processing. However if you notice it mixes the two text blocks at the bottom, which splits up the address, and possibly other information on other cards.
How can I possibly use Leptonica(or something else maybe OpenCV) to group the text somehow? Possibly send regions of text on the image individually to tesseract to scan?
I've been stuck on this problem for a while any possible solutions are welcome!
I would recommend using an algorithm called "Run Length Smoothing Algorithm" (RLSA). This algorithm is used in a lot of document image processing systems, though not every system expose it as part of its API.
The original paper was published in 1982 and requires payment. However, the same algorithm is cited by many other papers on document image processing, where you can easily find implementation details and improvements.
One such paper is this: http://www.sciencedirect.com/science/article/pii/S0262885609002005
The basic idea is to scan the document image row by row, recording the width of the gaps between letters.
Then, nearby text characters can be combined by filtering on the width of the gaps, and setting small gaps to the same color as the text. The result will be large connected components that represent:
Words,
By closing the gaps between characters,
Text lines,
By closing the gaps between words, and
Paragraphs
By scanning column by column and then closing the vertical gaps between text lines.
If you do not have access to any document image analysis libraries that expose this functionality, you can mimic the effect by:
Using morphological operations (morphological closing), and then
Perform connected-component labeling on the result.
Most image processing libraries, such as OpenCV, provides such functionality. It might be less efficient to take this approach because you will have to re-run the algorithm using different text gap sizes to achieve the different levels of clustering, unless the user provides your application with the text gap sizes.
I think you've hit on a fundamental problem with OCR - printed designs of this type use white space as meaningful delimiters, but the OCR software doesn't/can't understand that.
This is just a wild stab in the dark, but here's what I would try:
Starting in the upper left, build a box perhaps 1-5% of the size of the whole image. Send that to OCR, and see if you get what looks meaningful back. If not, expand until you get something.
Once you have something, start expanding the block in reasonable units until you stop getting new data back. You can, hopefully, decide this point is "meaningful white space", and now you can consider this processed text as "one block" and thus complete. Now start with whatever the next unprocessed part of the image is, and thus work your way through until you've got the whole image complete.
By working with a set of interlinking expanding boxes, the hope is you'll only get meaningful blocks of data grouped together. Working with your example, once you isolate the logo and process it (and the resulting gibberish), the next box will start with, say, the "N" in Noah. Then you expand out to the right until you've gotten the whole name.
With this done you go again and, hopefully, you'll get a bounding box that includes the "A" in Associate, and get that whole line.
A pixel at a time this would take too long with all those runs to the OCR, I'm sure, but there will surely be a trade-off in "size of chunks to expand per interval" and "amount of processing required".
I don't see why this approach wouldn't work for relatively normal print designs, like a regular style business card.
You can try HOCRText which return all the scanned words along with frame of each word in that image as an xml.
char *boxtext = _tesseract->GetHOCRText(0);
You can parse that xml to get each word and its frame.
Else if you need you can mention the frame in image which should tesseract scan.
_tesseract->SetRectangle(100, 100, 200, 200);
Set this frame before you call recognise. So tesseract will scan only that frame and return text at that frame.
There is a sample iOS application on Github that does this which might be helpful for you:
https://github.com/danauclair/CardScan
How does he read the business card? He writes the following, (or you can read it in the file: https://github.com/danauclair/CardScan/blob/master/Classes/CardParser.m )
// A class used to parse a bunch of OCR text into the fields of an ABRecordRef which can be added to the
// iPhone address book contacts. This class copies and opens a small SQLite databse with a table of ~5500
// common American first names which it uses to help decipher which text on the business card is the name.
//
// The class tokenizes the text by splitting it up by newlines and also by a simple " . " regex pattern.
// This is because many business cards put multiple "tokens" of information on a single line separated by
// spaces and some kind of character such as |, -, /, or a dot.
//
// Once the OCR text is fully tokenized it tries to identify the name (via SQLite table), job title (uses
// a set of common job title words), email, website, phone, address (all using regex patterns). The company
// or organization name is assumed to be the first token/line of the text unless that is the name.
//
// This is obviously a far from perfect parsing scheme for business card text, but it seems to work decently
// on a number of cards that were tested. I'm sure a lot of improvements can be made here.

Finding coordinates of an image in a pdf to replace it with another one

I have a pdf which I would like to use as a template to create a new pdf. The goal is to place an image inside a particular placeholder rectangle in the original pdf. The creation of the original pdf is under my control but the placeholder rectangle/bounds might be anywhere in the pdf. I am thinking of using a dummy image(of same dimensions) as the placeholder rectangle in the original pdf.
The Prawn gem supports placing an image at a given absolute/relative position within a page.
The trouble is that since the rectangle or dummy-image could be anywhere in the original pdf, I don't know what values to use for the following
pdf.image "/path/to/image", :at => [x,y] prawn call
Is there a way to get the coordinates of an image in the original pdf. My primitive understanding tells me that one would have to render the entire pdf to know this. Is that right ? If yes, what would be a good way to render pdf in memory (headless) and get the co-ordinates of various pdf objects(like bounding rectangles, images, etc).
I am not limited by language/runtime here as long as I can trigger it programmatically.
What could be other approaches to this problem ?
Not an answer (e.g. I don't know the Ruby language), but in lieu of any others, and because I can't post a comment yet, here's what I think.
If conditions stated above are true (placeholder and replacement images are exactly same size + same color model e.g. RGB 24 bps) and you control template creation (therefore you can store placeholder inside PDF uncompressed), it can be as quick and dirty as raw replacement in a file treated as byte string. E.g. placeholder filled with red, then you search for pattern (0xFF0000) x W*H and replace it with raw image data. Which, of course, you can get any way you like, e.g.:
convert my_image.jpg RGB:- | ...
If this solution is too dirty or conditions not exact, then parse page content stream for construct like
width 0 0 height x y cm
/name Do
It's not cleanest, either, but for vast number of simple page descriptions x and y are the coordinates you are looking for.
Further, if you control template creation, why don't you store additional information inside pdf as e.g. custom keys in Info dictionary, and then read them back when using the template.

Word Openxml: how to get a text box the right size?

I'm using PHP to generate docx documents from a database. The generated document contains column charts which have labels attached (i.e. user shapes containing textboxes). In an attempt to get the textboxes to accommodate and display all of the text (i.e. it shouldn't be necessary for the user to resize a textbox to see all the text) my code calculates how many characters will fit into 3cm, adds linefeeds to the string as required and tells me how many lines of text are needed. I have:
<a:xfrm xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:off x="1638276" y="1676399"/>
<a:ext cx="1257325" cy="'.(252000 * $labelLeftLines).'"/>
</a:xfrm>
which I believe should give me a text box around 3.5cm wide (extra .5 for the internal padding) and a height of .7cm multiplied by whatever is the value of $labelLeftLines. However, the text box always turns up as 3.cm wide by .86cm high, which only ever displays one line of text.
If I add in 'autofit':
<a:bodyPr xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" vertOverflow="clip" wrap="square" rtlCol="0">
<a:spAutoFit/>
</a:bodyPr>
the generated file looks just the same, though, when I right click on the textbox to inspect the properties, 'autofit' is indeed applied. I have to uncheck it and recheck it to make it affect the textbox.
Any openXML gurus out there?
Hmm, some random floundering around revealed that the values I need to manipulate are here:
<cdr:relSizeAnchor xmlns:cdr="http://schemas.openxmlformats.org/drawingml/2006/chartDrawing">
<cdr:from>
<cdr:x>0.47</cdr:x>
<cdr:y>0.75</cdr:y>
</cdr:from>
<cdr:to>
<cdr:x>0.67</cdr:x>
<cdr:y>1</cdr:y>
</cdr:to>
Changing those values does actually change the size of the texbox, though I haven't a clue what units are being used. From 0.75 to 1 produces a height of 1.43cm.
One day I'll maybe be able to find my way around the doucmentation.

Resources