Parsing PDF files - parsing

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K

When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Related

Gibberish table output in tabula-java for Japanese PDF but works in standalone Tabula

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

How can these strings be different?

I am facing a weird problem.
I have extracted data from an Excel file. It should contain an IBAN account number.
Then I tried to analyze the set of account numbers (which the source guarantees to be good) with a Java library.
To keep the scope of the question narrow, I can't explain the following. The below strings are different
030​69
03069
The first is a copy & paste from the Excel file, the second is handwritten. Google returns different results for abi [above number] and in fact in the second case I can find that it is the bank code for Intesa Sanpaolo bank (exact page displaying the ABI code, localized, here).
So, to keep the scope narrow: how is that possible? Is it something to do with the encoding?
Try it yourself: do CTRL+F and try type "030", it will select both lines. Now type 6, it will match only the 2nd line.
Same happened in Notepad++
There's an U+200B ZERO WIDTH SPACE in between 030 and 69 in the first text.
Paste the text in https://www.branah.com/unicode-converter for example, or edit in a hexadecimal capable editor.
The solution for cleaning such strings could be for example to whitelist characters, so replace everything that isn't A-Z0-9 will be scrubbed.

How to scan words, lines and their properties on the text?

I'need to scan a document. It's not OCR, let me show you:
--Example--
Table of Contents
Some Italic Words
Sentence 23
--End--
Suppose that as a ".doc" formatted text. I need to scan it line by line and understand the first line is bold, second is italic and third one includes space after first word and followed by a number. Reason i want to recognize them is i need to categorize them in a table view like bold lines italics, numbereds etc.
I'm okay in both swift and objective-c but totally clueless about document scanning. If you offer any reference, framework or approach i would be grateful to hear.
variant: your doc is really a docx. (docx is xml) Parse the XML. The format defines XML tags it uses to mark stuff bold or italic or whatever -- a docx is kind of like html.
variant: If your doc is really a doc! then we are not talking about xml but a binary format. It is also document and you can go parse it but I don't think will be easy
BUT
There is a library I know: doc2text that can parse a lot of stuff. (http://www.textlib.com/doc2text.html)
We used in past projects and it did an okay job and using this saves you A LOT of effort writing your own parsers

How to correctly display Japanese RTF Fonts

I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Private Function ConvertRtfToUnicode(ByVal value As String) As String
Dim ch As Char() = value.ToCharArray()
Dim c As Char
Dim sb As New System.Text.StringBuilder()
Dim code As Integer
For i As Integer = 0 To ch.Length - 1
c = ch(i)
code = Microsoft.VisualBasic.AscW(c)
If code <= 127 Then
'Don't need to replace if one of your typical ASCII codes
sb.Append(c)
Else
'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&threadid=33935972
' swaps the character for it's Unicode decimal code point equivalent
sb.Append(String.Format("\uc1\u{0:d}?", code))
End If
Next
Return sb.ToString()
End Function
Not sure if that will help your problem, but it's working for me.

Resources