RÓÍSÍN is being rendered as RÃôÃìSÃìN - which encoding is this? - character-encoding

I have a set of characters in a UTF-8 file like so:
RÓÍSÍN
HÉÁTHÉR
The file is being sent to another system, but the characters are being rendered like this:
RÃôÃìSÃìN ÃüNDREW
H̟̑TH̑R MULL̟N
Is it possible to tell from this information which character encoding the characters are being rendered as on the remote system?

I don't think you can tell exactly which encoding is being used, but you can tell it is an encoding that uses 1 byte per character. (UTF-8 use 1 to 4)
UTF-8 'Ó' is 0xC3 0x93, which is 195 244 in decimal. ANSI encoding would yield 'Ãô'. This matches your output.

Related

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

What scheme is used to encode unicode characters in a .url shortcut?

What scheme is used to encode unicode characters in a windows url shortcut?
For example, a new shortcut for url "http://Ψαℕ℧▶" produces a .url file with the text:
[{000214A0-0000-0000-C000-000000000046}]
Prop3=19,2
[InternetShortcut]
IDList=
URL=http://?aN??/
[InternetShortcut.A]
URL=http://?aN??/
[InternetShortcut.W]
URL=http://+A6gDsSEVIScltg-/
What is the algorithm to decode "+A6gDsSEVIScltg-" to "Ψαℕ℧▶"?
I am not asking for API code, but I would like to know the encoding scheme details.
Note: The encoding scheme is not utf-8 nor utf-16 nor ucs-2 and no %encoding.
+A6gDsSEVIScltg- is the UTF-7 encoded form of Ψαℕ℧▶.
The correct way to process a .url file is to use the IUniformResourceLocator and IPropertyStorage interfaces from the CLSID_InternetShortcut COM object. See Internet Shortcuts on MSDN for details.
The answer (utf-7) allowed me to successfully develop the url conversion routine.
Let me summarize the steps:
To obtain the unicode url from a InternetShortcut.W found in a .url file.
. Pass ascii chars until crlf, after making them internet safe.
. A none escaped + character starts a utf-7 formatted unicode sequence:
. Collect 6-bit nibbles from base64 coded ascii
. Per collected 16 bits, convert the 16 bits to utf-8 (1,2, or 3 chars)
. Pass the utf8 generated characters as %hh
. Continue until the occurrence of a "-" character
. The bit collector should be zero

Percent encoding a non extended ascii char like extended chars

If we percent encode the char "€", we will have %E2%82%AC as result. Ok!
My problem:
a = %61
I already know it.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
If yes, will browsers and servers understand the result as the char "a"?
If we percent encode the char "€", we will have %E2%82%AC as result.
€ is Unicode codepoint U+20AC EURO SIGN. The byte sequence 0xE2 0x82 0xAC is how U+20AC is encoded in UTF-8. %E2%82%AC is the URL encoding of those bytes.
a = %61
I already know it.
For ASCII character a, aka Unicode codepoint U+0061 LATIN SMALL LETTER A, that is correct. It is encoded as byte 0x61 in UTF-8 (and most other charsets), and thus can be encoded as %61 in URLs.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
Yes. Any character can be encoded using percent encoding in a URL. Simply encode the character in the appropriate charset, and then percent-encode the resulting bytes. However, most ASCII non-reserved characters do not require such encoding, just use them as-is.
If yes, will browsers and servers understand the result as the char "a"?
In URLs and URL-like content encodings (like application/x-www-webform-urlencoded), yes.

Inconsistent IO character reading when converting encoding

In Ruby 1.9.3-429, I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.
Simplified example:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end
line
end
Both files are just a single string áÁð encoded appropriately. I have checked that the files have been encoded correctly via $ file -i <file_name>
With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
The way I am interpreting this is readchar is returning an incorrectly converted encoding which is causing slice to return incorrectly.
Is this behavior correct? Or am I specifying the file external encoding incorrectly? I would rather not rewrite this process so I am hoping I am making a mistake somewhere. There are reasons why I am parsing files this way, but I don't think those are relevant to my question. Specifying the internal and external encoding as an option in File.open yielded the same results.
This behavior is a bug. See http://bugs.ruby-lang.org/issues/8516 for details.

Rails, Heroku and invalid byte sequence in UTF-8 error

I have a queue of text messages in Redis. Let's say a message in redis is something like this:
"niño"
(spot the non standard character).
The rails app displays the queue of messages. When I test locally (Rails 3.2.2, Ruby 1.9.3) everything is fine, but on Heroku cedar (Rails 3.2.2, I believe there is ruby 1.9.2) I get the infamous error: ActionView::Template::Error (invalid byte sequence in UTF-8)
After reading and rereading all I could find online I am still stuck as to how to fix this.
Any help or point to the right direction is greatly appreciated!
edit:
I managed to find a solution. I ended up using Iconv:
string = Iconv.iconv('UTF-8', 'ISO-8859-1', message)[0]
None of the suggested answers i found around seem to work in my case.
On Heroku, when your app receives the message "niño" from Redis, it is actually getting the four bytes:
0x6e 0x69 0xf1 0x6f
which, when interpreted as ISO-8859-1 correspond to the characters n, i, ñ and o.
However, your Rails app assumes that these bytes should be interpreted as UTF-8, and at some point it tries to decode them this way. The third byte in this sequence, 0xf1 looks like this:
1 1 1 1 0 0 0 1
If you compare this to the table on the Wikipedia page, you can see this byte is the leading byte of a four byte character (it matches the pattern 11110xxx), and as such should be followed by three more continuation bytes that all match the pattern 10xxxxxx. It's not, instead the next byte is 0x6f (01101111), and so this is invalid utf-8 byte sequence and you get the error you see.
Using:
string = message.encode('utf-8', 'iso-8859-1')
(or the Iconv equivalent) tells Ruby to read message as ISO-8859-1 encoded, and then to create the equivalent string in UTF-8 encoding, which you can then use without problems. (An alternative could be to use force_encoding to tell Ruby the correct encoding of the string, but that will likely cause problems later when you try to mix UTF-8 and ISO-8859-1 strings).
In UTF-8, the string "niño" corresponds to the bytes:
0x6e 0x69 0xc3 0xb1 0x6f
Note that the first, second and last bytes are the same. The ñ character is encoded as the two bytes 0xc3 0xb1. If you write these out in binary and compare to the table in the Wikipedia again article you'll see they encode 0xf1, which is the ISO-8859-1 encoding of ñ (since the first 256 unicode codepoints match ISO-8859-1).
If you take these five bytes and treat them as being ISO-8859-1, then they correspond to the string
niño
Looking at the ISO-8859-1 codepage, 0xc3 maps to Â, and 0xb1 maps to ±.
So what's happening on your local machine is that your app is receiving the five bytes 0x6e 0x69 0xc3 0xb1 0x6f from Redis, which is the UTF-8 representation of "niño". On Heroku it's receiving the four bytes 0x6e 0x69 0xf1 0x6f, which is the ISO-8859-1 representation.
The real fix to your problem will be to make sure the strings being put into Redis are all already UTF-8 (or at least all the same encoding). I haven't used Redis, but from what I can tell from a brief Google, it doesn't concern itself with string encodings but simply gives back whatever bytes it's been given. You should look at whatever process is putting the data into Redis, and ensure that it handles the encoding properly.

Resources