Correctly translate backslashed numbers to symbols? - lua

I just decompiled a file with luadec, it does it well, and, the output not being perfect, it's still usable, but I'm getting a weird string of numbers \198\247\184\181\188\177\177\219\183\161\189\186 that I know for a fact are in Korean language, but I do not know what they're called and basically can't find anything about them.
I just need to correctly translate the string from numbers to symbols or gibberish text, like this c±Ý»ö´À³¦Ç¥.
If someone could point me in the right direction I would be grateful, thanks.

I ran this script with Lua
print"\198\247\184\181\188\177\177\219\183\161\189\186"
and saved the output to a text file which I then loaded into Safari.
I got gibberish the default encoding. I got 포링선글래스 with Korean (Mac OS) encoding. Same thing with Korean (Windows, DOS), but not with Korean (ISO 2022-KR).
Note that escaped numbers in Lua are in decimal.

Related

RStudio - Script opened with wrong encoding, how to get back original characters?

I have a file in Spanish, when seen on my teacher's PC a bit of text would display as
regresión cuantílica más
but now that I've opened on mine I see this:
regresión cuantílica más
I have tried "Save with Encoding" to ISO-8859-1 and UTF-8 but it doesn't seem to change anything. Will I need to run some regex replacements on my file or is there a simpler way to fix this?
If you have already saved it and you've lost the original version of the file, it will be a pain to recover.
What you should have done when you noticed the bad characters was "Reopen with encoding", and chosen the "UTF-8" encoding. If you can still get the original file, do this now.
If you can't, then you're stuck with lots of manual fixing. Accented characters (and Euro signs, and a few other things) will show up as multi-character sequences. When you recognize one, use search and replace to replace that sequence with the correct character.

Cannot Serve International Characters From Lisp Portable AllegroServe

I am using Clozure Cl on Mac os x 10.9 and Portable allegro serve
I have a file with text has characters like ı ç ş ö (these are some characters Turkish also have) and some Arabic characters. I cannot serve them. when i visit from the browser this kind of characters are not displayed at all, only part of text showed is the ones until the first ı in the text.
In Lisp i use a function composed with a do and read-lines and format (or i have tried print princ prin1 also) reads entire document and when i set the :external-format :utf-8 it shows the read characters properly in Lisp. Problem is in serving them, if i can serve them as i read on Lisp it will be done.
Also If do not set :external-formatat all, in Lisp it is read improperly, as expected, however, this time the browser can show all the text but with wrong characters in place of above described characters.
How to fix that and use external-formats character encodings properly?
See http://www.xach.com/lisp/allegro-cl/2001-3/964.html for an example on how to use :external-format in AllegroServe.
Cheers
Frank
P.S. I also posted an answer to the same question newsgroup comp.lang.lisp .

lua string.upper not working with accented characters?

I'm trying to convert some French text to upper case in lua, it is not converting the accented characters. Any idea why?
test script:
print('échelle')
print(string.upper('échelle'))
print('ÉCHELLE')
print(string.lower('ÉCHELLE'))
output:
échelle
éCHELLE
ÉCHELLE
Échelle
It might be a bit overkill, but you can do this with slnunicode (which is available in LuaRocks).
require "unicode"
print(unicode.utf8.upper("échelle"))
-- ÉCHELLE
You may need to use unicode.ascii.upper or unicode.latin1.upper depending on the encoding of your source files.
You need to set a suitable locale, which depends how these strings are encoded in the source.
You seem to be using Latin 1 because of the output you gave.
In this case, trying adding the line below at the top of your script:
os.setlocale("fr_FR.ISO8859-1")
This name is for Mac OS X. For Linux, try
os.setlocale("fr_FR.iso88591")
If you're using UTF, then setting a locale won't help because string.lower converts the string one byte at a time.
Lua just uses the C library function toupper, which AFAIK doesn't support accented characters. You'd need to write a routine for that yourself.
To explain this all more effectively, Lua does not have built-in support for non-ASCII strings. You can store a Latin-1 or UTF-8-encoded string, but none of the special string manipulation functions (upper, lower, etc) will work on any non-ASCII character.
There are Lua libraries that add varying degrees of Unicode support. So you will have to use one of them.

How to display a character in the wrong characterset

How can I output whatever æ would be, if ø = ø?
I'm guessing the left side is unicode and the right side is something else, for example iso-8859-1, but how can I print out what a unicode character would be when messed up?
Backstory: I have a bit of a strange problem here with Steam messing up character encodings. Trying to help a friend recover their account and I think they have used the letter æ in their secret answer. The dialog for resetting the password doesn't accept that letter, and it says the answer is wrong if we try natural alternatives. In the recovery email I get, the letter ø shows up as ø in the secret question. So, I'm thinking perhaps when the answer and question was created, the letter æ was accepted, but messed up. Figured I could try to use the messed up equivalent, but don't know what that would be, and my programming skills fails me in finding it myself :p
In Python, you can encode the string to a byte-string in UTF-8, and then convert the byte-string to a (text) string using iso-8859-1. The result will be the desired mojibake.
In Python 3:
>>> 'æ'
'æ'
>>> 'æ'.encode('utf8')
b'\xc3\xa6'
>>> 'æ'.encode('utf8').decode('iso-8859-1')
'æ'
In Python 2, use u'æ' instead of 'æ'.

How to correctly display Japanese RTF Fonts

I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Private Function ConvertRtfToUnicode(ByVal value As String) As String
Dim ch As Char() = value.ToCharArray()
Dim c As Char
Dim sb As New System.Text.StringBuilder()
Dim code As Integer
For i As Integer = 0 To ch.Length - 1
c = ch(i)
code = Microsoft.VisualBasic.AscW(c)
If code <= 127 Then
'Don't need to replace if one of your typical ASCII codes
sb.Append(c)
Else
'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&threadid=33935972
' swaps the character for it's Unicode decimal code point equivalent
sb.Append(String.Format("\uc1\u{0:d}?", code))
End If
Next
Return sb.ToString()
End Function
Not sure if that will help your problem, but it's working for me.

Resources