I have an MVC application in which I am generating PDF from HTML pages with Rotativa. In the HTML I display some strings which I take from the resources of my application. When they are displayed as simple HTML, all the strings look good, but when the conversion to PDF is made, the exponential values are not formatted properly.
For numbers less than 4, everything looks good, like in², but when I am trying to display powers equal or higher than 4, (in⁴) the output is alterated like receiving a tilda ~ instead of the expected number. I assume this is because of the character set supported by Rotativa.
Is it possible to make Rotativa display exponential values higher than 3?
NOTE: I don't want to use <sup> x </sup> as it does not solve the problem of strings retrieved from resources.
I have tried changing the UTF enconding or font styles, but nothing worked.
I finally managed to solve this with a work-around. After I asked this question I understood how can I convert from exponential numbers, to simple string numbers, then used <sup>exponent</sup> I displayed the values from resources.
Not so pretty, but it's still working!
Related
I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language
I am using TCPDF for many years. Recently I had to work on Arabic language display. The client wanted SakkalMajalla font (available in Windows/font) and I converted this using TCPDF tool. The conversion process was successful without error.
Now, I am facing a little issue and could not solve it since last 2 months. One of the special characters (called tanween) is placed at the bottom of the preceding character whereas it should be on top.Everything else is working fine but little thing (ٍ
) displayed at wrong place changes the meaning of the word.
يمنع استخدام الهاتف الجوال داخل صالة الاختبار
منعاً باتاً
(I can not upload image as I need 10 reputation points for that, but please notice the little thing on top of this letter تاً. Here, it is displaying properly, but in the pdf it displays at the bottom of the letter.
Is there anyway to edit manually the positioning of this character?
I am searching for the solution for the last 2 months. I event wrote 2 emails to the author of TCPDF Nicolas, but he did not give any response.
Please help.
Even though the font conversion process appeared to work successfully, you should double-check with a font editor (like FontForge) to check that the character is actually encoded correctly in the converted font file.
I have found, after many years of trying to convert all sorts of non-Latin fonts from one format to another, that the most reliable solution for font conversion is this site:
http://www.xml-convert.com/en/convert-tff-font-to-afm-pfa-fpdf-tcpdf
We are using Jfreechart along with iText for generating pdf reports. For Japanese, we realized that in the rendered content for the graph legend, characters don't have any spaces between them. They basically overlap which makes it hard to read.
Do we need to use any special encoding?
Attached are images for expected and actual(generated by jfreechart), in that order
Below is a snippet of the graph generated with the legend
According to the PDF specification, a CIDFont dictionary contains an optional dictionary called DW and an optional array called W. DW is the default width for glyphs. If not set, it defaults to 1000.
The W array describes individual widths for characters in the font (if not specified they default to the value of DW). For many Japanese fonts, I've seen the value set to lower than 1000, but in this case it might be too low.
You can take a look at these values using Acrobat's "preflight>browse internal structure" tool. If these seem off, you make be using the wrong encoding. Setting encoding to "UniJIS-UCS2-H" should help resolve this issue.
I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...
maybe a way to batch convert also?
You could use Google Docs API to upload and convert .doc's.
http://code.google.com/apis/documents/overview.html
Some samples and code: http://code.google.com/apis/documents/code.html
Ruby example and demo:
http://code.google.com/p/gdata-samples/source/browse/#svn/trunk/doclist/DocListManager
http://doclistmanager.googlecodesamples.com/
The short answer is no, but the long answer is sorta.
MS Word itself will save a file out as html - but it's a total friggin' mess. To an extent this is simply because the customer base that is converting word files to html directly are not concerned about it being sloppy, so Word hasn't worked hard on making a clean output. On the other hand, it's intrinsically difficult, because word is oriented to create fixed size, non-dynamic documents, like a paper-base book. So it's easy to convert to other static formats (say a PDF), but how do you convert to HTML? Do you just make the text flow across? Do you set a width that will hopefully make the layout stay the same? What if there is fonts or layout elements in the word doc that are not available in the HTML renderer?
The easiest thing to do is to do it project by project - you can create a DTD to convert an RTF file, for instance - but this involves you making programmer level decisions about how these will be converted.