Overlapping Japanese content rendered by Jfreechart in pdf - localization

We are using Jfreechart along with iText for generating pdf reports. For Japanese, we realized that in the rendered content for the graph legend, characters don't have any spaces between them. They basically overlap which makes it hard to read.
Do we need to use any special encoding?
Attached are images for expected and actual(generated by jfreechart), in that order
Below is a snippet of the graph generated with the legend

According to the PDF specification, a CIDFont dictionary contains an optional dictionary called DW and an optional array called W. DW is the default width for glyphs. If not set, it defaults to 1000.
The W array describes individual widths for characters in the font (if not specified they default to the value of DW). For many Japanese fonts, I've seen the value set to lower than 1000, but in this case it might be too low.
You can take a look at these values using Acrobat's "preflight>browse internal structure" tool. If these seem off, you make be using the wrong encoding. Setting encoding to "UniJIS-UCS2-H" should help resolve this issue.

Related

Gibberish table output in tabula-java for Japanese PDF but works in standalone Tabula

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

Exponential number four not recognized in pdf generation with rotativa

I have an MVC application in which I am generating PDF from HTML pages with Rotativa. In the HTML I display some strings which I take from the resources of my application. When they are displayed as simple HTML, all the strings look good, but when the conversion to PDF is made, the exponential values are not formatted properly.
For numbers less than 4, everything looks good, like in², but when I am trying to display powers equal or higher than 4, (in⁴) the output is alterated like receiving a tilda ~ instead of the expected number. I assume this is because of the character set supported by Rotativa.
Is it possible to make Rotativa display exponential values higher than 3?
NOTE: I don't want to use <sup> x </sup> as it does not solve the problem of strings retrieved from resources.
I have tried changing the UTF enconding or font styles, but nothing worked.
I finally managed to solve this with a work-around. After I asked this question I understood how can I convert from exponential numbers, to simple string numbers, then used <sup>exponent</sup> I displayed the values from resources.
Not so pretty, but it's still working!

Auto page break in libHaru PDF

I'd like to add an automatic page break to a libHaru PDF in iOS.
I do have several text fields in the app which contain the user filled data. when i generate the pdf i first measure the expected size of the text-rect going to be created. if it exceeds the remaining space i trigger a hpdf_new_page event and put the text on an new page. i'd like to have this just in part automatically. so if the text exceeds the space on the current page it should split and continue on a new page without me checking or doing anything.
unfortunately i can't find anything like this in the documentation.
Line counting using fgets() may help. When your print program opens a file to print, each line can be copied to the pdf file and checked for a form feed character
or
if the line count has reached a limit.
Another possible solution is to use a character count limit with "while(getc(file) != EOF)".
This link uses libharu to print basic text files with PCL commands to change the font.
https://github.com/DaDaDadeo/GetCycle/blob/master/pcl_to_pdf.c
The form feed character '\f' (ascii 12) and 61 lines will trigger a new page. There are other conditions in the program to restrict a new page but the general idea is illustrated.
The results are the same as a printer using telnet raw 9100 protocol. The pcl commands are limited to just a couple of font changes so it is not too complicated.
Libharu is rather low-level library, and I could not even expect of appearing such automatic page splitting in newer versions due to number of reasons. Hereafter I state two of them:
There is no good, preferred strategy how to place remaining of non-fitting text on the next page. In some cases it could be even impossible at all.
There is no good, preferred strategy for text splitting.
Why?
Consider your font is extremely large, and just one letter (for instance, wide one as "W") does not fit into the page. Where we are supposed to place it? On the next page? Ok, we add new page... oops, it does not fit this page too - as soon as all our pages have the same size. Dead-end without any good, straightforward way out.
In other words, there should be a user-defined strategy for these cases. Almosy every naive implementation will have such a corner cases.
libharu does not know where it should split your text automatically. It does not know hyphenation rules of your language, it does not know whether it should respect spaces or not (wrap whole words only or not), and so on. It's up to you to specify these rules.
So, you should call HPDF_Font_MeasureText for some part of your text string, decide if it fits into your page (excluding margins, footers - which also out of libharu's internal knowledge) and render it. And note that there is no simple formula for text size depending on its length. String "wwww" is more than twice wider than "iiii", of course if your font is not mono-spaced.

PDFKitten is highlighting on wrong position

I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))

Resources