Why does Rails 3 think xE2x80x89 means â x80 x89 - ruby-on-rails

I have a field scraped from a utf-8 page:
"O’Reilly"
And saved in a yml file:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 is the correct UTF-8 representation of this apostrophe)
However when I load the value into a hash and yield it to a page tagged as utf-8, I get:
OâReilly
I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
How do I make rails interpret a 3-byte UTF-8 code as a single character?

Ruby strings are sequences of bytes instead of characters:
$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"
Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to
O’Reilly
This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.

Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:
#in ruby environment
puts YAML::ENGINE.yamler => syck
#in rails
puts YAML::ENGINE.yamler => psych
#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'

I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.
The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2 byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.
If this is the case you might have to force_encoding the read strings back to what they should have been, or set default_internal to cause the strings to be read back into UTF-8. Bit of a mess this.

Related

Handling UTF-8 Character with Latin1 db encoding

I keep getting an exception that ActiveRecord::StatementInvalid: PG::UntranslatableCharacter: ERROR: character with byte sequence 0xe2 0x80 0x99 in encoding "UTF8" has no equivalent in encoding "LATIN1". I did some checking and it looks like it is the backtick or apostrophe. What is the best way to handle this? Just strip out the character or convert the whole db to UTF-8? If it is converting to UTF-8 how can I do that permanently as it always seems to revert if you do it in the shell?
I don't understand what you mean by "revert, if done in the shell", but: You seem to have an application where some parts (at least the database) using encoding LATIN1, and one part (your Rails App) is using UTF-8. IMO, it is best if you have every in Unicode, but to what extend a conversion makes sense, can not be said in general. For example, if your database is also being processed by other tools, and those expect Latin1, a conversion is not sensible.
In any case, you need to define a clear borderline between where you use which encoding, and handle conversion at this border. This applies not only to the database, but also - for example - to the HTML pages you are generating (hopefully UTF-8), to files uploaded by the users and processes by your application, and so on.
If you convert to an encoding, where certain characters can not be represented - as this is in your case -, you have only three choices:
Reject the data (they must have been generated somewhere, perhaps as user input in a web form),
Simply remove the offending characters
Replace the offending characters by a placeholder (for instance, a question mark)
None of these options is very pleasant, but if converting your database to UTF-8 is no option, you should deal with this problem at the point where the problem string is generated, and not when it is written into the database.

Rails oracle raw16

I'm using Rails 3.2.1 and I have stuck on some problem for quite long.
I'm using oracle enhanced adapter and I have raw(16) (uuid) column and when I'm trying to display the data there is 2 situations:
1) I see the weird symbols
2) I'm getting incompatible character encoding: Ascii-8bit and utf-8.
In my application.rb file I added the
config.encoding = 'utf-8'
and in my view file I added
'#encoding=utf-8'
But so far nothing worked
I also tried to add html_safe but it failed .
How can I safely diaply my uuid data?
Thank you very much
Answer:
I used the unpack method to convert the
binary with those parameters
H8H4H4H4H12 and in the end joined the
array :-)
The RAW datatype is a string of bytes that can take any value. This includes binary data that doesn't translate to anything meaningful in ASCII or UTF-8 or in any character set.
You should really read Joel Spolsky's note about character sets and unicode before continuing.
Now, since the data can't be translated reliably to a string, how can we display it? Usually we convert or encode it, for instance:
we could use the hexadecimal representation where each byte is converted to two [0-9A-F] characters (in Oracle using the RAWTOHEX function). This is fine for display of small binary field such as RAW(16).
you can also use other encodings such as base 64, in Oracle with the UTL_ENCODE package.

Ruby Strings, get list of compatible encodings

How can I get a list of compatible encodings for a Ruby String? (MRI 1.9.3)
Use case: I have some user provided strings, encoded with UTF-8. Ideally I need to convert them to ISO/IEC 8859-1 (8-bit), but I also need to fallback to unicode when some special characters are present.
Also, is there a better way to accomplis this? Maybe I am testing the wrong thing.
EDIT- adding more details
Tanks for the answers, I should probably add some context.
I know how to perform encoding conversion.
I'm looking for a way to quickly find out if a string can be safely encoded to another encoding or, to put it in another (and quite wrong) way, what is the minimum encoding to support all the characters in that string.
Just converting the strings to 16-byte is not an option, because they will be sent as SMSs and converting them to a 16-byte encoding cuts the amount of available characters from 160 down to 70.
I need to convert them to 16-bytes only when they contain a special character which is not supported in ISO/IEC 8859-1.
Unluckily, Ruby’s ideas of encoding compatibility are not fully congruent with your use case. However, trying to encode your UTF-8 string in ISO-8859-1 and catching the error that is thrown when a conversion is not possible will achieve what you are after:
begin
'your UTF-8 string'.encode!('ISO-8859-1')
rescue Encoding::UndefinedConversionError
end
will convert your string to ISO-8859-1 if possible and leave it as UTF-8 if not.
Note this uses encode, which actually transcodes the string using Encoding::Converter (i.e. reassigns the correct encoding byte pattern to the character representations of the string), unlike force_encoding, which just changes the encoding flag (i.e. tells Ruby to interpret the string’s byte stream according to the set encoding).
Ruby has standard library in which u can find class Encoding and his sub-class called Encoding::Converter they are probably your best friends in this case.
#!/usr/bin/env ruby
# encoding: utf-8
converter = Encoding::Converter.new("UTF-8", "ISO-8859-1")
converted = converter.convert("é")
puts converted.encoding
# => ISO-8859-1
puts converted.dump
# => "\xE9"
Is valid_encoding? (instance method of String) useful? That is:
try_str = str.force_encoding("ISO/IEC 8859-1")
str = try_str if try_str.valid_encoding?
To convert to ISO-8859-1 you can follow the below code to encode it.
1.9.3p194 :002 > puts "é".force_encoding("ISO-8859-1").encode("UTF-8")
é
=> nil
Linked Answer
"Some String".force_encoding("ISO/IEC 8859-1")
Also you can refer rails encoding link

Detecting non-ASCII characters in Rails

I am wondering if there's a way to detect non-ASCII characters in Rails.
I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these characters in Rails? or just specify the range of characters I am expecting?
Is there a plugin for this? Thanks in advance!
All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)
You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.
class String
def multibyte?
chars.count < bytes.count
end
end
"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false
This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' ' and '~' when you have UTF-8 encoded text:
>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"
There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.
If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT or BINARY then you can probably get away with s.force_encoding('utf-8'). If you end up with something other than UTF-8 and ASCII-8BIT then you can use Iconv to re-encode it.
References:
Encoding
Iconv
String#force_encoding

How to use regex for utf8 in ruby

In RoR,how to validate a Chinese or a Japanese word for a posting form with utf8 code.
In GBK code, it uses [\u4e00-\u9fa5]+ to validate Chinese words.
In Php, it uses /^[\x{4e00}-\x{9fa5}]+$/u for utf-8 pages.
Ruby 1.8 has poor support for UTF-8 strings. You need to write the bytes individually in the regular expression, rather then the full code:
>> "acentuação".scan(/\xC3\xA7/)
=> ["ç"]
To match the range you specified the expression will become a bit complicated:
/([\x4E-\x9E][\x00-\xFF])|(\x9F[\x00-\xA5])/ # (untested)
That will be improved in Ruby 1.9, though.
Edit: As noted in the comments, the unicode characters \u4E00-\u9FA5 only map to the expression above in the UTF16-BE encoding. The UTF8 encoding is likely different. So you need to analyze the mapping carefully and see if you can come up with a byte-matching expression for Ruby 1.8.
This is what i have done:
%r{^[#{"\344\270\200"}-#{"\351\277\277"}]+$}
This is basically a regular expression with the octal values that represent the range between U+4E00 and U+9FFF, the most common Chinese and Japanese characters.
The Oniguruma regexp engine has proper support for Unicode. Ruby 1.9 uses Oniguruma by default. Ruby 1.8 can be recompiled to use it.
With Oniguruma you can use the exact same regex as in PHP, including the /u modifier to force Ruby to treat the string as UTF-8.
activeSupport has a UTF-8 handler
http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html
otherwise, look in ruby 1.9, encoding method for Regexp objects

Resources