Rails, Heroku and invalid byte sequence in UTF-8 error - ruby-on-rails

I have a queue of text messages in Redis. Let's say a message in redis is something like this:
"niño"
(spot the non standard character).
The rails app displays the queue of messages. When I test locally (Rails 3.2.2, Ruby 1.9.3) everything is fine, but on Heroku cedar (Rails 3.2.2, I believe there is ruby 1.9.2) I get the infamous error: ActionView::Template::Error (invalid byte sequence in UTF-8)
After reading and rereading all I could find online I am still stuck as to how to fix this.
Any help or point to the right direction is greatly appreciated!
edit:
I managed to find a solution. I ended up using Iconv:
string = Iconv.iconv('UTF-8', 'ISO-8859-1', message)[0]
None of the suggested answers i found around seem to work in my case.

On Heroku, when your app receives the message "niño" from Redis, it is actually getting the four bytes:
0x6e 0x69 0xf1 0x6f
which, when interpreted as ISO-8859-1 correspond to the characters n, i, ñ and o.
However, your Rails app assumes that these bytes should be interpreted as UTF-8, and at some point it tries to decode them this way. The third byte in this sequence, 0xf1 looks like this:
1 1 1 1 0 0 0 1
If you compare this to the table on the Wikipedia page, you can see this byte is the leading byte of a four byte character (it matches the pattern 11110xxx), and as such should be followed by three more continuation bytes that all match the pattern 10xxxxxx. It's not, instead the next byte is 0x6f (01101111), and so this is invalid utf-8 byte sequence and you get the error you see.
Using:
string = message.encode('utf-8', 'iso-8859-1')
(or the Iconv equivalent) tells Ruby to read message as ISO-8859-1 encoded, and then to create the equivalent string in UTF-8 encoding, which you can then use without problems. (An alternative could be to use force_encoding to tell Ruby the correct encoding of the string, but that will likely cause problems later when you try to mix UTF-8 and ISO-8859-1 strings).
In UTF-8, the string "niño" corresponds to the bytes:
0x6e 0x69 0xc3 0xb1 0x6f
Note that the first, second and last bytes are the same. The ñ character is encoded as the two bytes 0xc3 0xb1. If you write these out in binary and compare to the table in the Wikipedia again article you'll see they encode 0xf1, which is the ISO-8859-1 encoding of ñ (since the first 256 unicode codepoints match ISO-8859-1).
If you take these five bytes and treat them as being ISO-8859-1, then they correspond to the string
niño
Looking at the ISO-8859-1 codepage, 0xc3 maps to Â, and 0xb1 maps to ±.
So what's happening on your local machine is that your app is receiving the five bytes 0x6e 0x69 0xc3 0xb1 0x6f from Redis, which is the UTF-8 representation of "niño". On Heroku it's receiving the four bytes 0x6e 0x69 0xf1 0x6f, which is the ISO-8859-1 representation.
The real fix to your problem will be to make sure the strings being put into Redis are all already UTF-8 (or at least all the same encoding). I haven't used Redis, but from what I can tell from a brief Google, it doesn't concern itself with string encodings but simply gives back whatever bytes it's been given. You should look at whatever process is putting the data into Redis, and ensure that it handles the encoding properly.

Related

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

RÓÍSÍN is being rendered as RÃôÃìSÃìN - which encoding is this?

I have a set of characters in a UTF-8 file like so:
RÓÍSÍN
HÉÁTHÉR
The file is being sent to another system, but the characters are being rendered like this:
RÃôÃìSÃìN ÃüNDREW
H̟̑TH̑R MULL̟N
Is it possible to tell from this information which character encoding the characters are being rendered as on the remote system?
I don't think you can tell exactly which encoding is being used, but you can tell it is an encoding that uses 1 byte per character. (UTF-8 use 1 to 4)
UTF-8 'Ó' is 0xC3 0x93, which is 195 244 in decimal. ANSI encoding would yield 'Ãô'. This matches your output.

string.sub in Corona Lua crashes with ÅÄÖ

this snippet crashes my simulator bad.
s = "stämma"
s1 = string.sub(s,3,3)
print(s1)
It seems like it handles my character as nil, any ideas?
Joakim
I assume you are using UTF-8 encoding.
In UTF-8, a character can have a variable number of bytes, between 1 to 4. The "ä" character (228) is encoded with the two bytes 0xC3 0xA4.
The instruction string.sub(s, 3, 3) returns the third byte from the string (0xC3), and not the third character. As this byte alone is invalid UTF-8, Corona can't display the character.
See also Extract the first letter of a UTF-8 string with Lua

tackle different types of utf hyphens in ruby 1.8.7

We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD ­
Non-breaking hyphen U+2011 &#8209
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
These all have to be converted to Hyphen-minus(-) using gsub.
I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.
ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'
Any idea how to accomplish this?
Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.
You don't need to convert Hyphen-minus(-) U+002D - to simple hyphen/minus (ascii 45); they're the same thing.
You believe that the database encoding is latin1. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.
Presuming that "fetched string" means "byte string extracted from the database", chardet is very likely quite right in reporting windows-1252 aka cp1252 -- however this may be by accident as chardet sometimes seems to report that as a default when it has exhausted other possibilities.
(a) These Unicode characters cannot be decoded into latin1 or cp1252 or ascii:
Minus(−) U+2212 − or − or −
Hyphen(-) U+2010
Non-breaking hyphen U+2011 &#8209
Figure dash(‒) U+2012 (8210) ‒ or ‒
Horizontal bar(―) U+2015 (8213) ― or ―
What gives you the impression that they may possibly appear in the input or in the database?
(b) These Unicode characters can be decoded into cp1252 but not latin1 or ascii:
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that chardet reported as windows-1252?
(c) This can be decoded into cp1252 and latin1 but not ascii:
Soft Hyphen U+00AD ­
If a string contains non-ASCII characters, any attempt (using iconv or any other method) to convert it to ascii will fail, unless you use some kind of "ignore" or "replace with ?" option. Why are you trying to do that?

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources