Which codepage is 0x81 = ü, 0x94 = ö, 0x9A = Ü? - character-encoding

I've got a CSV file, which has a character encoding which I can't identify. From it's content (German language entries) I could find the following characters matching some 1-byte character encodings:
0x81 = ü
0x94 = ö
0x9A = Ü
Which Codepage is this? Is there any website where you can maybe lookup code pages by known entries?
I was assuming this could be WINDOWS-1252 or ISO-8859-1, but it's neither of them.

As I found out by some more trial and error the encoding is "CP 437" or also called "DOS". Weird to see such an encoding used nowadays.

Related

RÓÍSÍN is being rendered as RÃôÃìSÃìN - which encoding is this?

I have a set of characters in a UTF-8 file like so:
RÓÍSÍN
HÉÁTHÉR
The file is being sent to another system, but the characters are being rendered like this:
RÃôÃìSÃìN ÃüNDREW
H̟̑TH̑R MULL̟N
Is it possible to tell from this information which character encoding the characters are being rendered as on the remote system?
I don't think you can tell exactly which encoding is being used, but you can tell it is an encoding that uses 1 byte per character. (UTF-8 use 1 to 4)
UTF-8 'Ó' is 0xC3 0x93, which is 195 244 in decimal. ANSI encoding would yield 'Ãô'. This matches your output.

Percent encoding a non extended ascii char like extended chars

If we percent encode the char "€", we will have %E2%82%AC as result. Ok!
My problem:
a = %61
I already know it.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
If yes, will browsers and servers understand the result as the char "a"?
If we percent encode the char "€", we will have %E2%82%AC as result.
€ is Unicode codepoint U+20AC EURO SIGN. The byte sequence 0xE2 0x82 0xAC is how U+20AC is encoded in UTF-8. %E2%82%AC is the URL encoding of those bytes.
a = %61
I already know it.
For ASCII character a, aka Unicode codepoint U+0061 LATIN SMALL LETTER A, that is correct. It is encoded as byte 0x61 in UTF-8 (and most other charsets), and thus can be encoded as %61 in URLs.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
Yes. Any character can be encoded using percent encoding in a URL. Simply encode the character in the appropriate charset, and then percent-encode the resulting bytes. However, most ASCII non-reserved characters do not require such encoding, just use them as-is.
If yes, will browsers and servers understand the result as the char "a"?
In URLs and URL-like content encodings (like application/x-www-webform-urlencoded), yes.

Inconsistent IO character reading when converting encoding

In Ruby 1.9.3-429, I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.
Simplified example:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end
line
end
Both files are just a single string áÁð encoded appropriately. I have checked that the files have been encoded correctly via $ file -i <file_name>
With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
The way I am interpreting this is readchar is returning an incorrectly converted encoding which is causing slice to return incorrectly.
Is this behavior correct? Or am I specifying the file external encoding incorrectly? I would rather not rewrite this process so I am hoping I am making a mistake somewhere. There are reasons why I am parsing files this way, but I don't think those are relevant to my question. Specifying the internal and external encoding as an option in File.open yielded the same results.
This behavior is a bug. See http://bugs.ruby-lang.org/issues/8516 for details.

Rails, Heroku and invalid byte sequence in UTF-8 error

I have a queue of text messages in Redis. Let's say a message in redis is something like this:
"niño"
(spot the non standard character).
The rails app displays the queue of messages. When I test locally (Rails 3.2.2, Ruby 1.9.3) everything is fine, but on Heroku cedar (Rails 3.2.2, I believe there is ruby 1.9.2) I get the infamous error: ActionView::Template::Error (invalid byte sequence in UTF-8)
After reading and rereading all I could find online I am still stuck as to how to fix this.
Any help or point to the right direction is greatly appreciated!
edit:
I managed to find a solution. I ended up using Iconv:
string = Iconv.iconv('UTF-8', 'ISO-8859-1', message)[0]
None of the suggested answers i found around seem to work in my case.
On Heroku, when your app receives the message "niño" from Redis, it is actually getting the four bytes:
0x6e 0x69 0xf1 0x6f
which, when interpreted as ISO-8859-1 correspond to the characters n, i, ñ and o.
However, your Rails app assumes that these bytes should be interpreted as UTF-8, and at some point it tries to decode them this way. The third byte in this sequence, 0xf1 looks like this:
1 1 1 1 0 0 0 1
If you compare this to the table on the Wikipedia page, you can see this byte is the leading byte of a four byte character (it matches the pattern 11110xxx), and as such should be followed by three more continuation bytes that all match the pattern 10xxxxxx. It's not, instead the next byte is 0x6f (01101111), and so this is invalid utf-8 byte sequence and you get the error you see.
Using:
string = message.encode('utf-8', 'iso-8859-1')
(or the Iconv equivalent) tells Ruby to read message as ISO-8859-1 encoded, and then to create the equivalent string in UTF-8 encoding, which you can then use without problems. (An alternative could be to use force_encoding to tell Ruby the correct encoding of the string, but that will likely cause problems later when you try to mix UTF-8 and ISO-8859-1 strings).
In UTF-8, the string "niño" corresponds to the bytes:
0x6e 0x69 0xc3 0xb1 0x6f
Note that the first, second and last bytes are the same. The ñ character is encoded as the two bytes 0xc3 0xb1. If you write these out in binary and compare to the table in the Wikipedia again article you'll see they encode 0xf1, which is the ISO-8859-1 encoding of ñ (since the first 256 unicode codepoints match ISO-8859-1).
If you take these five bytes and treat them as being ISO-8859-1, then they correspond to the string
niño
Looking at the ISO-8859-1 codepage, 0xc3 maps to Â, and 0xb1 maps to ±.
So what's happening on your local machine is that your app is receiving the five bytes 0x6e 0x69 0xc3 0xb1 0x6f from Redis, which is the UTF-8 representation of "niño". On Heroku it's receiving the four bytes 0x6e 0x69 0xf1 0x6f, which is the ISO-8859-1 representation.
The real fix to your problem will be to make sure the strings being put into Redis are all already UTF-8 (or at least all the same encoding). I haven't used Redis, but from what I can tell from a brief Google, it doesn't concern itself with string encodings but simply gives back whatever bytes it's been given. You should look at whatever process is putting the data into Redis, and ensure that it handles the encoding properly.

What Character Encoding Is This?

I need to clean up some file containing French text. Problem is that the files erroneously contain multiple encodings within the same file.
I think some sections are ISO8859-1 (Latin 1) but other parts have text encoded in single byte characters that look like 'extended' ASCII. In other words, it is UTF-7 encoding plus the following:
0x82 for é (e acute)
0x8a for è (e grave)
0x88 for ê (e circumflex)
0x85 for à (a grave)
0x87 for ç (c cedilla)
What encoding is this?
That's the original IBM PC encoding, Code page 437.
This website here shows a link with 0x87 for cedilla. I haven't look much further than this, but I bet the rest of your information could be found here as well.

Resources