TCPDF wrong Arabic Character Display with Sakkal Majalla Font - tcpdf

I am using TCPDF for many years. Recently I had to work on Arabic language display. The client wanted SakkalMajalla font (available in Windows/font) and I converted this using TCPDF tool. The conversion process was successful without error.
Now, I am facing a little issue and could not solve it since last 2 months. One of the special characters (called tanween) is placed at the bottom of the preceding character whereas it should be on top.Everything else is working fine but little thing (ٍ
) displayed at wrong place changes the meaning of the word.
يمنع استخدام الهاتف الجوال داخل صالة الاختبار
منعاً باتاً
(I can not upload image as I need 10 reputation points for that, but please notice the little thing on top of this letter تاً. Here, it is displaying properly, but in the pdf it displays at the bottom of the letter.
Is there anyway to edit manually the positioning of this character?
I am searching for the solution for the last 2 months. I event wrote 2 emails to the author of TCPDF Nicolas, but he did not give any response.
Please help.

Even though the font conversion process appeared to work successfully, you should double-check with a font editor (like FontForge) to check that the character is actually encoded correctly in the converted font file.
I have found, after many years of trying to convert all sorts of non-Latin fonts from one format to another, that the most reliable solution for font conversion is this site:
http://www.xml-convert.com/en/convert-tff-font-to-afm-pfa-fpdf-tcpdf

Related

How can these strings be different?

I am facing a weird problem.
I have extracted data from an Excel file. It should contain an IBAN account number.
Then I tried to analyze the set of account numbers (which the source guarantees to be good) with a Java library.
To keep the scope of the question narrow, I can't explain the following. The below strings are different
030​69
03069
The first is a copy & paste from the Excel file, the second is handwritten. Google returns different results for abi [above number] and in fact in the second case I can find that it is the bank code for Intesa Sanpaolo bank (exact page displaying the ABI code, localized, here).
So, to keep the scope narrow: how is that possible? Is it something to do with the encoding?
Try it yourself: do CTRL+F and try type "030", it will select both lines. Now type 6, it will match only the 2nd line.
Same happened in Notepad++
There's an U+200B ZERO WIDTH SPACE in between 030 and 69 in the first text.
Paste the text in https://www.branah.com/unicode-converter for example, or edit in a hexadecimal capable editor.
The solution for cleaning such strings could be for example to whitelist characters, so replace everything that isn't A-Z0-9 will be scrubbed.

Auto page break in libHaru PDF

I'd like to add an automatic page break to a libHaru PDF in iOS.
I do have several text fields in the app which contain the user filled data. when i generate the pdf i first measure the expected size of the text-rect going to be created. if it exceeds the remaining space i trigger a hpdf_new_page event and put the text on an new page. i'd like to have this just in part automatically. so if the text exceeds the space on the current page it should split and continue on a new page without me checking or doing anything.
unfortunately i can't find anything like this in the documentation.
Line counting using fgets() may help. When your print program opens a file to print, each line can be copied to the pdf file and checked for a form feed character
or
if the line count has reached a limit.
Another possible solution is to use a character count limit with "while(getc(file) != EOF)".
This link uses libharu to print basic text files with PCL commands to change the font.
https://github.com/DaDaDadeo/GetCycle/blob/master/pcl_to_pdf.c
The form feed character '\f' (ascii 12) and 61 lines will trigger a new page. There are other conditions in the program to restrict a new page but the general idea is illustrated.
The results are the same as a printer using telnet raw 9100 protocol. The pcl commands are limited to just a couple of font changes so it is not too complicated.
Libharu is rather low-level library, and I could not even expect of appearing such automatic page splitting in newer versions due to number of reasons. Hereafter I state two of them:
There is no good, preferred strategy how to place remaining of non-fitting text on the next page. In some cases it could be even impossible at all.
There is no good, preferred strategy for text splitting.
Why?
Consider your font is extremely large, and just one letter (for instance, wide one as "W") does not fit into the page. Where we are supposed to place it? On the next page? Ok, we add new page... oops, it does not fit this page too - as soon as all our pages have the same size. Dead-end without any good, straightforward way out.
In other words, there should be a user-defined strategy for these cases. Almosy every naive implementation will have such a corner cases.
libharu does not know where it should split your text automatically. It does not know hyphenation rules of your language, it does not know whether it should respect spaces or not (wrap whole words only or not), and so on. It's up to you to specify these rules.
So, you should call HPDF_Font_MeasureText for some part of your text string, decide if it fits into your page (excluding margins, footers - which also out of libharu's internal knowledge) and render it. And note that there is no simple formula for text size depending on its length. String "wwww" is more than twice wider than "iiii", of course if your font is not mono-spaced.

Seemingly unsupported unicode character in UILabel

I have a UILabel that displays a string coming in from a web service. It seems to be properly displaying some unicode characters, but not all. The string comes from the web service in a JSON object as follows:
"\u2b51 \u2605 Special Chars"
This is displayed in the UILabel like so:
Clearly, it's displaying the \u2605 character just fine but not the \u2b51 character. The font is Helvetica Neue--the system font.
Am I doing something wrong or is this a bug in iOS and/or the font?
This seems to be purely a font issue. The character U+2605 BLACK STAR “★” is relatively common in fonts, so it is probably taken from a system font or a fallback font. The character U+2B51 BLACK SMALL STAR “⭑” is relatively rare; it was added in Unicode 5.1, i.e. rather recently (in the character code world, that is). According to Fileformat.info data, it appears in Code2000, FreeSerif, GNU Unifont, Quivira, STIX, STIXMath, and Symbola. Not much; most computers have none of them (though many Linux systems probably have FreeSerif). Well, it seems that you can add Asana Math and Universalia to the list; still rather limited.

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

What is the best way to produce a tilde in LaTeX for a website?

Following the previous questions on this topic, when you produce a website in LaTeX what is the best way to produce a url that contains a tilde? \verb produces the upper tilde that does not read well, and $\sim$ does not copy/pase well (adding a space when I do it). Solutions?
It seems like this should be one of those things that has a very easy fix... if it doesn't, why not?
I'd look at the url package.
I know this is an old question, but I recently came up with something that, despite a severe lack of elegance, works beautifully.
\catcode`~=11 % make LaTeX treat tilde (~) like a normal character
\newcommand{\urltilde}{\kern -.15em\lower .7ex\hbox{~}\kern .04em}
\catcode`~=13 % revert back to treating tilde (~) as an active character
Now you can use \urltilde inside of a \url tag (even in a .bib file) and:
1) the URL will render perfectly;
2) clicking on the URL will take you to the correct address; and,
3) copy-paste will put the correct address in the clipboard.
This is the only solution I have found that satisfies all three of these requirements. I hope it helps somebody out there.
url package didn't work for me. hyperref does the job.
\usepackage{hyperref}
\url{http://website.com/~username/some_stuff/}
I think it is better to use URL encoding in such a case (see, e.g., http://www.blooberry.com/indexdot/html/topics/urlencoding.htm).
It means replacing the tilde in the link with %7E.
Maybe it does not look so good in the final document (readers will see %7E instead of the tilde), but at least the copy-paste functionality works for sure, which I think is the most important thing.
For instance, for the link www.example.com/~someuser/somepage.htm I use the following code:
{\tt http://www.example.com/\%7Esomeuser/somepage.htm}
PS: The same applies for all links with white spaces or any other special characters.
I think $_{\widetilde{~}}$ works good for the tilde issue.
I want to suggest using %7e
\tt{http://example.com/\%7etest}
tt is for making it monospace.
It looks a bit different, but it allows copy-and-paste.
\symbol{126} would be another way, but in the default font it also yields a superscripted tilde. An ugly hack (but what isn't in LaTeX) would be to use
${}_{\textrm{\symbol{126}}}$
which produces a text tilde in Math mode and subscripts it. So it appears in the middle of the line. Seems to work for a clickable link as well. You can always put that into a command on its own :)
I'm not a latex user admittedly, but does this page help?
http://www.cse.wustl.edu/~mgeorg/html/tildalatex.html
They do the following:
\def\urltilda{\kern -.15em\lower .7ex\hbox{\~{}}\kern .04em}
\def\urldot{\kern -.10em.\kern -.10em}
\def\urlhttp{http\kern -.10em\lower -.1ex\hbox{:}\kern -.12em\lower 0ex\hbox{/}\kern -.18em\lower 0ex\hbox{/}}
The way this is used is
{\tt mgeorg#cse\urldot wustl\urldot edu}
{\tt \urlhttp www\urldot cse\urldot wustl\urldot edu/\urltilda mgeorg}
After wasting a lot of time on a related problem with LaTeXing a tilde, I thought I should record my results here in case it is a help to anyone else.
tldr: To avoid some of the difficulties of typesetting a proper tilde ~ character, I recommend adding
\usepackage[T1]{fontenc}
in the preamble of your latex file to get more modern versions of font encoding. On modern TeX installations, this automatically loads the T1-encoded version of Knuth's Computer Modern fonts.
Details:
As has been noted by previous posters, using standard pdflatex/latex with the default fonts (the original OT1-encoded Computer Modern), there are difficulties in trying to render a regular tilde character, ~, and there are various workarounds with varying degrees of satisfaction. One workaround is to use the textasciitilde command. For example, you can use it to put a tilde in a URL, like:
https://w3.pppl.gov/\textasciitilde{}hammett
This gives a raised tilde (the kind of tilde intended for accents over another character), and methods to lower it to a more normal tilde are discussed by other posters. While this works as expected if the resulting PDF file is viewed in Adobe Acrobat, a downside is that if the PDF is viewed with the built-in Preview app on a Mac (or if viewed in TeXShop's previewer, which uses the same Mac libraries for rendering), then this URL link visually looks correct as https://w3.pppl.gov/~hammett, but if you actually click on it, it gets interpreted only as
https://w3.pppl.gov/
If you try to cut and paste the whole URL from Preview into a browser, the tilde in the pdf is replaced with a blank
https://w3.pppl.gov/%20hammett
so it doesn't work. (If you paste into some browsers, there are other special characters added also). (The \url{...} command will display a URL with tildes correctly, but forces it to use a fixed-space terminal font, and there are times when you want to use a tilde somewhere besides in a URL.)
Investigating further, I learned that the character that latex puts into the pdf file is not a regular tilde character but a "Combining Tilde"
COMBINING TILDE Unicode: U+0303, UTF-8: CC 83
(previously known as a "non-spacing tilde", indicating its usage for accents over another character) and that is what Mac's Preview renders. This means that cut-and-paste from Mac's Preview into a browser fails.
However, Adobe Acrobat somehow implemented a workaround and when displaying the pdf converts this into a regular "Tilde"
TILDE Unicode: U+007E, UTF-8: 7E
so cut-and-paste of a URL with a tilde from an Adobe-displayed pdf file into a web browser works fine.
(I determined these unicode values by copying the rendered tilde character from the TeXShop Preview window and pasting it into the Character Viewer app in the menu bar on a Mac. Then right-click on the character to select "Copy Character Info". The Character Viewer app can be turned on in Mac Preferences, Keyboard, and select "Show keyboard and emoji viewers in menu bar".)
This problem goes away if you use T1-encoded fonts by adding
\usepackage[T1]{fontenc}
to your latex file preamble. Furthermore, if you use the Adobe Times-like font
\usepackage{newtxtext}
then tilde is rendered at mid height automatically and doesn't need to be lowered. (The above command uses the newtx font only for regular text, while leaving the math fonts unchanged. I like newtxtext because it is slightly darker than CM. But for math I prefer to keep Knuth's traditional Computer Modern CM font, because the newtxmath font, like other Times math fonts, renders math italic v in way that is confusingly like a Greek nu.)
P.S.: For more info on why it's always good to use the fontenc package for a more modern approach than the 1970's original encoding, see (1) https://www.texfaq.org/FAQ-why-inp-font, (2) https://tex.stackexchange.com/questions/664/why-should-i-use-usepackaget1fontenc, and (3) https://en.wikibooks.org/wiki/LaTeX/Fonts.

Resources