I've got a CSV file, which has a character encoding which I can't identify. From it's content (German language entries) I could find the following characters matching some 1-byte character encodings:
0x81 = ü
0x94 = ö
0x9A = Ü
Which Codepage is this? Is there any website where you can maybe lookup code pages by known entries?
I was assuming this could be WINDOWS-1252 or ISO-8859-1, but it's neither of them.
As I found out by some more trial and error the encoding is "CP 437" or also called "DOS". Weird to see such an encoding used nowadays.
I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".
In Ruby 1.9.3-429, I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.
Simplified example:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end
line
end
Both files are just a single string áÁð encoded appropriately. I have checked that the files have been encoded correctly via $ file -i <file_name>
With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
The way I am interpreting this is readchar is returning an incorrectly converted encoding which is causing slice to return incorrectly.
Is this behavior correct? Or am I specifying the file external encoding incorrectly? I would rather not rewrite this process so I am hoping I am making a mistake somewhere. There are reasons why I am parsing files this way, but I don't think those are relevant to my question. Specifying the internal and external encoding as an option in File.open yielded the same results.
This behavior is a bug. See http://bugs.ruby-lang.org/issues/8516 for details.
this snippet crashes my simulator bad.
s = "stämma"
s1 = string.sub(s,3,3)
print(s1)
It seems like it handles my character as nil, any ideas?
Joakim
I assume you are using UTF-8 encoding.
In UTF-8, a character can have a variable number of bytes, between 1 to 4. The "ä" character (228) is encoded with the two bytes 0xC3 0xA4.
The instruction string.sub(s, 3, 3) returns the third byte from the string (0xC3), and not the third character. As this byte alone is invalid UTF-8, Corona can't display the character.
See also Extract the first letter of a UTF-8 string with Lua
I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.