What is the best way to produce a tilde in LaTeX for a website? - latex

Following the previous questions on this topic, when you produce a website in LaTeX what is the best way to produce a url that contains a tilde? \verb produces the upper tilde that does not read well, and $\sim$ does not copy/pase well (adding a space when I do it). Solutions?
It seems like this should be one of those things that has a very easy fix... if it doesn't, why not?

I'd look at the url package.

I know this is an old question, but I recently came up with something that, despite a severe lack of elegance, works beautifully.
\catcode`~=11 % make LaTeX treat tilde (~) like a normal character
\newcommand{\urltilde}{\kern -.15em\lower .7ex\hbox{~}\kern .04em}
\catcode`~=13 % revert back to treating tilde (~) as an active character
Now you can use \urltilde inside of a \url tag (even in a .bib file) and:
1) the URL will render perfectly;
2) clicking on the URL will take you to the correct address; and,
3) copy-paste will put the correct address in the clipboard.
This is the only solution I have found that satisfies all three of these requirements. I hope it helps somebody out there.

url package didn't work for me. hyperref does the job.
\usepackage{hyperref}
\url{http://website.com/~username/some_stuff/}

I think it is better to use URL encoding in such a case (see, e.g., http://www.blooberry.com/indexdot/html/topics/urlencoding.htm).
It means replacing the tilde in the link with %7E.
Maybe it does not look so good in the final document (readers will see %7E instead of the tilde), but at least the copy-paste functionality works for sure, which I think is the most important thing.
For instance, for the link www.example.com/~someuser/somepage.htm I use the following code:
{\tt http://www.example.com/\%7Esomeuser/somepage.htm}
PS: The same applies for all links with white spaces or any other special characters.

I think $_{\widetilde{~}}$ works good for the tilde issue.

I want to suggest using %7e
\tt{http://example.com/\%7etest}
tt is for making it monospace.
It looks a bit different, but it allows copy-and-paste.

\symbol{126} would be another way, but in the default font it also yields a superscripted tilde. An ugly hack (but what isn't in LaTeX) would be to use
${}_{\textrm{\symbol{126}}}$
which produces a text tilde in Math mode and subscripts it. So it appears in the middle of the line. Seems to work for a clickable link as well. You can always put that into a command on its own :)

I'm not a latex user admittedly, but does this page help?
http://www.cse.wustl.edu/~mgeorg/html/tildalatex.html
They do the following:
\def\urltilda{\kern -.15em\lower .7ex\hbox{\~{}}\kern .04em}
\def\urldot{\kern -.10em.\kern -.10em}
\def\urlhttp{http\kern -.10em\lower -.1ex\hbox{:}\kern -.12em\lower 0ex\hbox{/}\kern -.18em\lower 0ex\hbox{/}}
The way this is used is
{\tt mgeorg#cse\urldot wustl\urldot edu}
{\tt \urlhttp www\urldot cse\urldot wustl\urldot edu/\urltilda mgeorg}

After wasting a lot of time on a related problem with LaTeXing a tilde, I thought I should record my results here in case it is a help to anyone else.
tldr: To avoid some of the difficulties of typesetting a proper tilde ~ character, I recommend adding
\usepackage[T1]{fontenc}
in the preamble of your latex file to get more modern versions of font encoding. On modern TeX installations, this automatically loads the T1-encoded version of Knuth's Computer Modern fonts.
Details:
As has been noted by previous posters, using standard pdflatex/latex with the default fonts (the original OT1-encoded Computer Modern), there are difficulties in trying to render a regular tilde character, ~, and there are various workarounds with varying degrees of satisfaction. One workaround is to use the textasciitilde command. For example, you can use it to put a tilde in a URL, like:
https://w3.pppl.gov/\textasciitilde{}hammett
This gives a raised tilde (the kind of tilde intended for accents over another character), and methods to lower it to a more normal tilde are discussed by other posters. While this works as expected if the resulting PDF file is viewed in Adobe Acrobat, a downside is that if the PDF is viewed with the built-in Preview app on a Mac (or if viewed in TeXShop's previewer, which uses the same Mac libraries for rendering), then this URL link visually looks correct as https://w3.pppl.gov/~hammett, but if you actually click on it, it gets interpreted only as
https://w3.pppl.gov/
If you try to cut and paste the whole URL from Preview into a browser, the tilde in the pdf is replaced with a blank
https://w3.pppl.gov/%20hammett
so it doesn't work. (If you paste into some browsers, there are other special characters added also). (The \url{...} command will display a URL with tildes correctly, but forces it to use a fixed-space terminal font, and there are times when you want to use a tilde somewhere besides in a URL.)
Investigating further, I learned that the character that latex puts into the pdf file is not a regular tilde character but a "Combining Tilde"
COMBINING TILDE Unicode: U+0303, UTF-8: CC 83
(previously known as a "non-spacing tilde", indicating its usage for accents over another character) and that is what Mac's Preview renders. This means that cut-and-paste from Mac's Preview into a browser fails.
However, Adobe Acrobat somehow implemented a workaround and when displaying the pdf converts this into a regular "Tilde"
TILDE Unicode: U+007E, UTF-8: 7E
so cut-and-paste of a URL with a tilde from an Adobe-displayed pdf file into a web browser works fine.
(I determined these unicode values by copying the rendered tilde character from the TeXShop Preview window and pasting it into the Character Viewer app in the menu bar on a Mac. Then right-click on the character to select "Copy Character Info". The Character Viewer app can be turned on in Mac Preferences, Keyboard, and select "Show keyboard and emoji viewers in menu bar".)
This problem goes away if you use T1-encoded fonts by adding
\usepackage[T1]{fontenc}
to your latex file preamble. Furthermore, if you use the Adobe Times-like font
\usepackage{newtxtext}
then tilde is rendered at mid height automatically and doesn't need to be lowered. (The above command uses the newtx font only for regular text, while leaving the math fonts unchanged. I like newtxtext because it is slightly darker than CM. But for math I prefer to keep Knuth's traditional Computer Modern CM font, because the newtxmath font, like other Times math fonts, renders math italic v in way that is confusingly like a Greek nu.)
P.S.: For more info on why it's always good to use the fontenc package for a more modern approach than the 1970's original encoding, see (1) https://www.texfaq.org/FAQ-why-inp-font, (2) https://tex.stackexchange.com/questions/664/why-should-i-use-usepackaget1fontenc, and (3) https://en.wikibooks.org/wiki/LaTeX/Fonts.

Related

Gibberish table output in tabula-java for Japanese PDF but works in standalone Tabula

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language

Are there Ansi escape sequences for superscript and subscript?

I'm playing around with ANSI escape sequences, e.g.
echo -e "\e[91mHello\e[m"
on a Linux console to display colored text.
Now I try to use superscript and subscript output like a=b².
I read here and here about: Partial Line Down (subscript) and Partial Line Up (superscript) but I'm not sure about the exact syntax and even which terminal client might supports this.
Any suggestions about this?
Possibly some commercial product supports it, but it's not supported by any terminal emulator you'll encounter (unless someone modifies one just to prove a point).
The standard describes possible escape sequences, but there is no requirement that any given sequence is supported by any terminal. There are commonly supported (and assumed) sequences such as clearing the screen, but even for that, not all terminals have supported the feature.
The reason is that terminal emulators are generally used with applications (such as text editors) which assume a regular set of rows/columns, and that the text is shown compactly (no extra space such as would be needed to allow for partial line movement. Back in the day when people used typewriters, it was common to have 1.5 or 2.0 line-spacing, and get no more than 33 lines on a page. That changed, long ago.
The need for subscripts/superscripts didn't go away — Unicode provides a usable set of characters with that representation (see Superscripts and Subscripts
Range: 2070–209F)
Further reading:
Your New Royal Portable (1953).
Line Spacing - Butterick's Practical Typography
console_codes - Linux console escape and control sequences

Tilde over n when when converting from markdown to latex with pandoc

I have a markdown document that I convert to PDF via pandoc's latex engine. I'm trying to render an n with a tilde over it, as in "niño", with markdown like the following:
ni\~{n}o
...but this just gets rendered in the PDF as "ni~no" -- i.e. the tilde gets interpreted literally. I've also tried escaping the backslash (ni\\~{n}o), surrounding everything in brackets (ni{\~{n}}o), and basically what I think is every possible combination of escaping characters in this sequence, but nothing works. It also fails even when the sequence is on its own (i.e. \~{n}).
But, other similar sequences that are based on letters rather than symbols work just fine (e.g. Otter\r{a} gets rendered correctly to "Otterå"). Pandoc is specifically failing to handle the tilde (or maybe more generally non-letter-based latex character sequences -- I haven't tested others).
The command I'm using to build the pdf is pandoc file.md -o file.pdf. I've also tried specifying -f markdown+raw_tex, but it still fails (nor should I need to, since the \r{a} works without it, and I think raw_tex is enabled by default anyway).
Any thoughts? I know I can use xetex to just enter these characters directly, but that's not really a satisfying solution...
Besides using the ñ character directly (which apparently works in native Pandoc because it's magic!), an alternative is to create a simple LaTeX \newcommand for forcing native TeX interpretation.
\newcommand{\tex}[1]{#1}
ni\tex{\~n}o
Thanks to John McFarlane for introducing me to this clever workaround!

Cannot Serve International Characters From Lisp Portable AllegroServe

I am using Clozure Cl on Mac os x 10.9 and Portable allegro serve
I have a file with text has characters like ı ç ş ö (these are some characters Turkish also have) and some Arabic characters. I cannot serve them. when i visit from the browser this kind of characters are not displayed at all, only part of text showed is the ones until the first ı in the text.
In Lisp i use a function composed with a do and read-lines and format (or i have tried print princ prin1 also) reads entire document and when i set the :external-format :utf-8 it shows the read characters properly in Lisp. Problem is in serving them, if i can serve them as i read on Lisp it will be done.
Also If do not set :external-formatat all, in Lisp it is read improperly, as expected, however, this time the browser can show all the text but with wrong characters in place of above described characters.
How to fix that and use external-formats character encodings properly?
See http://www.xach.com/lisp/allegro-cl/2001-3/964.html for an example on how to use :external-format in AllegroServe.
Cheers
Frank
P.S. I also posted an answer to the same question newsgroup comp.lang.lisp .

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Resources