iOS PDF to plain text parser - ios

I'm quite at a lost on this subject. I've read pretty much every post about it here on SO, I would very much appreciate it if somebody would nudge me in the right direction.
I have a PDF and I would like to extract it's text, I'm only interested in words and spaces. I have setup a CGPDFScanner and it's callback methods. What I have read is that I only need to consider 4 operators TJ, Tj, qout(') and doubleqout(") as far as extracting text goes.
I guess I also need to keep track of the text space to be able to determine whether the letters should be put together to form a word or should be separated by a space. But I have no idea how I would have to do this.
In the PDF, all text is in the format
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
but I have not been able to figure out (using the PDF specification) what these numbers mean. Somebody on SO said that you should not be scared of the PDF specs but frankly I do not find them very easy to read/understand.
I have studied the PDFKitten code which was helpful.
Any help would be greatly appreciated.

I cannot give you advice how to extract words from PDF, but the format of
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
is explained for example in the PDF 1.7 Specification, section "9.4.3 Text-Showing Operators". The description of the TJ operator is:
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space.
So the numbers are adjustments to the distance between the letters.

Related

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

Eggplant : How to read text with special characters like ' _ etc

I am trying to read a text in a given rectangle using readText() function.
The function works correctly except when it has to read some text which has special characters like ' _ & etc.
I tried using validCharacters with readText() function. But it didn't help.
Code -
put ReadText((287,125,810,164),validCharacters:"_-'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890") into Login
I tried working with character collections. But that doesn't seem to be right because the text trying to pick is a dynamic text combination of numbers alphabets and a special character. So one cannot create a library of character collection of every alphabet (a-z, A-Z), numbers(0-9) and special characters.
Example of text trying to read:
Login_Userid1_1, Login'Userid1_1
So how do I read such text correctly
Debugging OCR is a bit of an imprecise science. EggPlant has a lot of OCR Parameters to tweak. When designing test cases it's best to try use other mechanisms to gather information whenever possible. ReadText() should be considered a last resort when more reliable methods are unavailable. When I've used it I've often needed to do a lot of trial and error to find the right set of settings, and SearchRectangle to get consistent results. Without seeing exactly what images you are trying to read text from it's difficult to impossible to troubleshoot where the issue might be.
One thing that does stand out to me is that you're trying to read strings that may contain underscores. ReadText() has an optional property IgnoreUnderscores which treats underscores as spaces. By default this property is set to ON. It defaults to ON because some OCR engines have problems identifying underscore characters consistently.
If you want to have ReadText() handle underscores you'll want to explicitly set this property to OFF.
ReadText(rect, validCharacters:chars, ignoreUnderscores:OFF)

Recognize letters written with symbols

I have a bit of an unorthodox question that I cannot think of an approach how to tackle. I have some letters written like this:
/\ |---\ /---\
/ \ |___/ |
/----\ | \ |
/ \ |___/ \---/
Now, the idea is to read this content (possibly from a text file) and parse it to the real letters they actually represent. So this should be parsed to ABC.
I understand this is not OCR, but I have no idea if something like that is possible. I am not asking for a solution, but rather, how would you best attack this problem? What would be a good criteria for distinguishing when a 'letter' starts and when does it end?
Based on the comments it sounds like you could store a character font map (2-dimensional array for each character) and then read the input file and buffer a number of lines equal to the height of the characters.
Then, for each group of lines you would want to segment the input based on the width of the characters and slide across horizontally, looking for matches against your font map.
If you need to support multiple fonts then things get more complicated and you'd benefit more from a neural-net approach to character recognition of sorts.
One important aspect to keep in mind about how OCR typically works is that it takes an arbitrary image and it "pixelates" it generating a much lower resolution image. In your case you've already got a "pixelated" representation of the image and all you'd have to do is read in the input and feed that into the rest of the pipeline.
I would still approach this as an OCR-esque problem.
You could first draw the characters onto an image and run it through an available OCR library.
Or you could do it yourself.
Pre-process it by converting vertical and horzitonal characters into lines first.
Then where there are forward and backslashes, approximate start and finish points of the curve by where they meet the previous horizontal and vertical (a different approach would be needed for letters such as 'o' or 'e').
Once you have this image a simple pattern analysis approach, such as naive bayes should be able to produce reliable results.
Whether the pre processing would actually provide accuracy improvements, i'm not sure

How do I design a heuristic for matching translated sentences?

Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.
There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Resources