How can I output whatever æ would be, if ø = ø?
I'm guessing the left side is unicode and the right side is something else, for example iso-8859-1, but how can I print out what a unicode character would be when messed up?
Backstory: I have a bit of a strange problem here with Steam messing up character encodings. Trying to help a friend recover their account and I think they have used the letter æ in their secret answer. The dialog for resetting the password doesn't accept that letter, and it says the answer is wrong if we try natural alternatives. In the recovery email I get, the letter ø shows up as ø in the secret question. So, I'm thinking perhaps when the answer and question was created, the letter æ was accepted, but messed up. Figured I could try to use the messed up equivalent, but don't know what that would be, and my programming skills fails me in finding it myself :p
In Python, you can encode the string to a byte-string in UTF-8, and then convert the byte-string to a (text) string using iso-8859-1. The result will be the desired mojibake.
In Python 3:
>>> 'æ'
'æ'
>>> 'æ'.encode('utf8')
b'\xc3\xa6'
>>> 'æ'.encode('utf8').decode('iso-8859-1')
'æ'
In Python 2, use u'æ' instead of 'æ'.
Related
I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.
I just decompiled a file with luadec, it does it well, and, the output not being perfect, it's still usable, but I'm getting a weird string of numbers \198\247\184\181\188\177\177\219\183\161\189\186 that I know for a fact are in Korean language, but I do not know what they're called and basically can't find anything about them.
I just need to correctly translate the string from numbers to symbols or gibberish text, like this c±Ý»ö´À³¦Ç¥.
If someone could point me in the right direction I would be grateful, thanks.
I ran this script with Lua
print"\198\247\184\181\188\177\177\219\183\161\189\186"
and saved the output to a text file which I then loaded into Safari.
I got gibberish the default encoding. I got 포링선글래스 with Korean (Mac OS) encoding. Same thing with Korean (Windows, DOS), but not with Korean (ISO 2022-KR).
Note that escaped numbers in Lua are in decimal.
I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML
I've had this problem for a long time but I've been implementing this ugly hack on the backend to get around it.
Now I've decided to act as a real developer and deal with it.
My problem is that when parsing an XML feed with any of the Norwegian characters æ, ø or å in the title node, all the letters appearing before these special characters are ommitted.
So if the word is "Bålhuset" it only displays "ålhuset" - the funny thing is that æ,ø and å characters AFTER the initial problem character is included.
So if I put for example "ÅBålhuset", I will get "Bålhuset". So it seems it's only the first occurence of one of these special characters that will cause a problem.
Any help would be immensely appreciated!
-Chris
Try while you creating XML use CDATA tags like
<title><![CDATA[Transport "Bålhuset"Classic World's]]></title>
Also here is a list of HTML Tags and more cases XML with those characters is invalid, unless they are contained within a CDATA. Also try this Question hope with help you
Otherwise you need to use their special character code. If you want to represent ö you need to type ö please review like.
And Final XML with those characters is invalid, unless they are contained within a CDATA.
You can Validate you XML while creating and easily fix the bug.
What did it for me was getting the data in JSON and using the native JSON methods; no dropped characters and other sporadic behaviour.
So what that means to me is that there is an issue with NSXMLParser that makes it choke on international characters (the first occurence of which mind you) even though everything is in order with encoding etc.
I have a file. I don't know how it was processed. It's probably a double encoding. I've found this link about double encoding that solved almost my problem:
http://www.spamusers.com/encoding.htm
It has all the double encodings substitutions to do like:
À àÁ
 Â
Unfortnately I still others weird characters like:
ú
ç
ö
Do you have an idea on how to clean these weird characters? For the ones I know I've just made a bash script and I've just replaced them. But I don't know how to recognize the others. I'm running on linux so if you have some magic commands I would like that.
The “double encodings substitutions” page that you link to seems to contain mappings meant to fix character data that has been doubly UTF-8 encoded. Thus, the proper fixing routine would be to reverse such mappings and see if the result makes sense.
For example, if you take A with grave accent, À, U+00C0, and UTF-8 encode it, you get the bytes C3 A0. If these are then mistakenly understood as single-byte encodings according to windows-1252 for example, you get the characters U+00C3 U+00A0 (letter à and no-break space). If these are then UTF-8 encoded, you get C3 83 for the former and C2 80 for the latter. If these bytes in turn are interpreted according to windows-1252, you get À as on the page.
But you don’t actually have “À”, do you? You have some digital data, bytes, which display that way if interpreted according to windows-1252. But that would be a wrong interpretation.
You should first read the data as UTF-8 encoded, decode it to characters, checking that all codes are less than 100 hexadecimal (if not, there’s yet another error involved somewhere), then UTF-9 decode again.