Cannot Serve International Characters From Lisp Portable AllegroServe - localization

I am using Clozure Cl on Mac os x 10.9 and Portable allegro serve
I have a file with text has characters like ı ç ş ö (these are some characters Turkish also have) and some Arabic characters. I cannot serve them. when i visit from the browser this kind of characters are not displayed at all, only part of text showed is the ones until the first ı in the text.
In Lisp i use a function composed with a do and read-lines and format (or i have tried print princ prin1 also) reads entire document and when i set the :external-format :utf-8 it shows the read characters properly in Lisp. Problem is in serving them, if i can serve them as i read on Lisp it will be done.
Also If do not set :external-formatat all, in Lisp it is read improperly, as expected, however, this time the browser can show all the text but with wrong characters in place of above described characters.
How to fix that and use external-formats character encodings properly?

See http://www.xach.com/lisp/allegro-cl/2001-3/964.html for an example on how to use :external-format in AllegroServe.
Cheers
Frank
P.S. I also posted an answer to the same question newsgroup comp.lang.lisp .

Related

RStudio - Script opened with wrong encoding, how to get back original characters?

I have a file in Spanish, when seen on my teacher's PC a bit of text would display as
regresión cuantílica más
but now that I've opened on mine I see this:
regresión cuantílica más
I have tried "Save with Encoding" to ISO-8859-1 and UTF-8 but it doesn't seem to change anything. Will I need to run some regex replacements on my file or is there a simpler way to fix this?
If you have already saved it and you've lost the original version of the file, it will be a pain to recover.
What you should have done when you noticed the bad characters was "Reopen with encoding", and chosen the "UTF-8" encoding. If you can still get the original file, do this now.
If you can't, then you're stuck with lots of manual fixing. Accented characters (and Euro signs, and a few other things) will show up as multi-character sequences. When you recognize one, use search and replace to replace that sequence with the correct character.

How to read a text file in ancient encoding?

There is a public project called Moby containing several word lists. Some files contain European alphabets symbols and were created in pre-Unicode time. Readme, dated 1993, reads:
"Foreign words commonly used in English usually include their
diacritical marks, for example, the acute accent e is denoted by ASCII
142."
Wikipedia says that the last ASCII symbol has number 127.
For example this file: http://www.gutenberg.org/files/3203/files/mobypos.txt contains symbols that I couldn't read in any of vatious Latin encodings. (There are plenty of such symbols in the very end of section of words beginning with B, just before C letter. )
Could someone advise please what encoding should be used for reading this file or how can it be converted to some readable modern encoding?
A little research suggests that the encoding for this page is Mac OS Roman, which has é at position 142. Viewing the page you linked and changing the encoding (in Chrome, View → Encoding → Western (Macintosh)) seems to display all the words correctly (it is incorrectly reporting ISO-8859-1).
How you deal with this depends on the language / tools you are using. Here’s an example of how you could convert into UTF-8 with Ruby:
require 'open-uri'
s = open('http://www.gutenberg.org/files/3203/files/mobypos.txt').read
s.force_encoding('macroman')
s.encode!('utf-8')
You are right in that ASCII only goes up to position 127 (it’s a 7-bit encoding), but there are a large number of 8 bit encodings that are supersets of ASCII and people sometimes refer to those as “Extended ASCII”. It appears that whoever wrote the readme you refer to didn’t know about the variety of encodings and thought the one he happened to be using at the time was universal.
There isn’t a general solution to problems like this, as there is no guaranteed way to determine the encoding of some text from the text itself. In this case I just used Wikipedia to look through a few until I found one that matched. Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a good place to start reading about character sets and encodings if you want to learn more.

character encoding output to a file in linux

The working environment is jboss+mssql
I am doing a query and output the formatted result to a text file. The query result has some French accent characters.
On my local machine, everything works fine, but on the UAT server (linux box, UTF-8), the french accent characters become question marks.
Does anyone know how to solve it?
It depends on how you create your file - a code example would be helpful.
If you do specify an encoding explicitly, e.g. when creating a Writer, then if it doesn't match the locale of the machine on which you view the file, you may see question marks, placeholder boxes etc. instead of accented letters. You can use the locale command to check your locale and this will make it possible to learn the associated character encoding. This is just a matter of viewing the file. You say that the box is UTF-8, but do ensure that the app is also running under a UTF-8 locale - your user console and the server app may be using different locales.
If you do not specify the character encoding when writing, most often you will end up using the system's locale. In that case it may happen that this locale doesn't support the characters you need, so they are replaced with placeholders. A solution would be to change the locale with which your app is running e.g. by exporting the corresponding LC_* environmental variables.
So, the short checklist goes like this:
How do you write your file? Is the encoding specified explicitly?
What is the locale with which the app is running (output of locale command)?
Check the actual bytes written to your file using od -t x1 command or using a hex viewer like the one included in mc. Are the question marks actual question marks (hex code 3F), or rather some other character? If they take one byte, they're probably in one of the Latin-N (ISO 8859-N) encodings. If they take more than one byte, it's probably UTF-8 (I understand the letters a-z look normal, so it's not UTF-16).

How to correctly display Japanese RTF Fonts

I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Private Function ConvertRtfToUnicode(ByVal value As String) As String
Dim ch As Char() = value.ToCharArray()
Dim c As Char
Dim sb As New System.Text.StringBuilder()
Dim code As Integer
For i As Integer = 0 To ch.Length - 1
c = ch(i)
code = Microsoft.VisualBasic.AscW(c)
If code <= 127 Then
'Don't need to replace if one of your typical ASCII codes
sb.Append(c)
Else
'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&threadid=33935972
' swaps the character for it's Unicode decimal code point equivalent
sb.Append(String.Format("\uc1\u{0:d}?", code))
End If
Next
Return sb.ToString()
End Function
Not sure if that will help your problem, but it's working for me.

What is the best way to produce a tilde in LaTeX for a website?

Following the previous questions on this topic, when you produce a website in LaTeX what is the best way to produce a url that contains a tilde? \verb produces the upper tilde that does not read well, and $\sim$ does not copy/pase well (adding a space when I do it). Solutions?
It seems like this should be one of those things that has a very easy fix... if it doesn't, why not?
I'd look at the url package.
I know this is an old question, but I recently came up with something that, despite a severe lack of elegance, works beautifully.
\catcode`~=11 % make LaTeX treat tilde (~) like a normal character
\newcommand{\urltilde}{\kern -.15em\lower .7ex\hbox{~}\kern .04em}
\catcode`~=13 % revert back to treating tilde (~) as an active character
Now you can use \urltilde inside of a \url tag (even in a .bib file) and:
1) the URL will render perfectly;
2) clicking on the URL will take you to the correct address; and,
3) copy-paste will put the correct address in the clipboard.
This is the only solution I have found that satisfies all three of these requirements. I hope it helps somebody out there.
url package didn't work for me. hyperref does the job.
\usepackage{hyperref}
\url{http://website.com/~username/some_stuff/}
I think it is better to use URL encoding in such a case (see, e.g., http://www.blooberry.com/indexdot/html/topics/urlencoding.htm).
It means replacing the tilde in the link with %7E.
Maybe it does not look so good in the final document (readers will see %7E instead of the tilde), but at least the copy-paste functionality works for sure, which I think is the most important thing.
For instance, for the link www.example.com/~someuser/somepage.htm I use the following code:
{\tt http://www.example.com/\%7Esomeuser/somepage.htm}
PS: The same applies for all links with white spaces or any other special characters.
I think $_{\widetilde{~}}$ works good for the tilde issue.
I want to suggest using %7e
\tt{http://example.com/\%7etest}
tt is for making it monospace.
It looks a bit different, but it allows copy-and-paste.
\symbol{126} would be another way, but in the default font it also yields a superscripted tilde. An ugly hack (but what isn't in LaTeX) would be to use
${}_{\textrm{\symbol{126}}}$
which produces a text tilde in Math mode and subscripts it. So it appears in the middle of the line. Seems to work for a clickable link as well. You can always put that into a command on its own :)
I'm not a latex user admittedly, but does this page help?
http://www.cse.wustl.edu/~mgeorg/html/tildalatex.html
They do the following:
\def\urltilda{\kern -.15em\lower .7ex\hbox{\~{}}\kern .04em}
\def\urldot{\kern -.10em.\kern -.10em}
\def\urlhttp{http\kern -.10em\lower -.1ex\hbox{:}\kern -.12em\lower 0ex\hbox{/}\kern -.18em\lower 0ex\hbox{/}}
The way this is used is
{\tt mgeorg#cse\urldot wustl\urldot edu}
{\tt \urlhttp www\urldot cse\urldot wustl\urldot edu/\urltilda mgeorg}
After wasting a lot of time on a related problem with LaTeXing a tilde, I thought I should record my results here in case it is a help to anyone else.
tldr: To avoid some of the difficulties of typesetting a proper tilde ~ character, I recommend adding
\usepackage[T1]{fontenc}
in the preamble of your latex file to get more modern versions of font encoding. On modern TeX installations, this automatically loads the T1-encoded version of Knuth's Computer Modern fonts.
Details:
As has been noted by previous posters, using standard pdflatex/latex with the default fonts (the original OT1-encoded Computer Modern), there are difficulties in trying to render a regular tilde character, ~, and there are various workarounds with varying degrees of satisfaction. One workaround is to use the textasciitilde command. For example, you can use it to put a tilde in a URL, like:
https://w3.pppl.gov/\textasciitilde{}hammett
This gives a raised tilde (the kind of tilde intended for accents over another character), and methods to lower it to a more normal tilde are discussed by other posters. While this works as expected if the resulting PDF file is viewed in Adobe Acrobat, a downside is that if the PDF is viewed with the built-in Preview app on a Mac (or if viewed in TeXShop's previewer, which uses the same Mac libraries for rendering), then this URL link visually looks correct as https://w3.pppl.gov/~hammett, but if you actually click on it, it gets interpreted only as
https://w3.pppl.gov/
If you try to cut and paste the whole URL from Preview into a browser, the tilde in the pdf is replaced with a blank
https://w3.pppl.gov/%20hammett
so it doesn't work. (If you paste into some browsers, there are other special characters added also). (The \url{...} command will display a URL with tildes correctly, but forces it to use a fixed-space terminal font, and there are times when you want to use a tilde somewhere besides in a URL.)
Investigating further, I learned that the character that latex puts into the pdf file is not a regular tilde character but a "Combining Tilde"
COMBINING TILDE Unicode: U+0303, UTF-8: CC 83
(previously known as a "non-spacing tilde", indicating its usage for accents over another character) and that is what Mac's Preview renders. This means that cut-and-paste from Mac's Preview into a browser fails.
However, Adobe Acrobat somehow implemented a workaround and when displaying the pdf converts this into a regular "Tilde"
TILDE Unicode: U+007E, UTF-8: 7E
so cut-and-paste of a URL with a tilde from an Adobe-displayed pdf file into a web browser works fine.
(I determined these unicode values by copying the rendered tilde character from the TeXShop Preview window and pasting it into the Character Viewer app in the menu bar on a Mac. Then right-click on the character to select "Copy Character Info". The Character Viewer app can be turned on in Mac Preferences, Keyboard, and select "Show keyboard and emoji viewers in menu bar".)
This problem goes away if you use T1-encoded fonts by adding
\usepackage[T1]{fontenc}
to your latex file preamble. Furthermore, if you use the Adobe Times-like font
\usepackage{newtxtext}
then tilde is rendered at mid height automatically and doesn't need to be lowered. (The above command uses the newtx font only for regular text, while leaving the math fonts unchanged. I like newtxtext because it is slightly darker than CM. But for math I prefer to keep Knuth's traditional Computer Modern CM font, because the newtxmath font, like other Times math fonts, renders math italic v in way that is confusingly like a Greek nu.)
P.S.: For more info on why it's always good to use the fontenc package for a more modern approach than the 1970's original encoding, see (1) https://www.texfaq.org/FAQ-why-inp-font, (2) https://tex.stackexchange.com/questions/664/why-should-i-use-usepackaget1fontenc, and (3) https://en.wikibooks.org/wiki/LaTeX/Fonts.

Resources