I am wondering if there's a way to detect non-ASCII characters in Rails.
I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these characters in Rails? or just specify the range of characters I am expecting?
Is there a plugin for this? Thanks in advance!
All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)
You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.
class String
def multibyte?
chars.count < bytes.count
end
end
"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false
This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' ' and '~' when you have UTF-8 encoded text:
>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"
There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.
If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT or BINARY then you can probably get away with s.force_encoding('utf-8'). If you end up with something other than UTF-8 and ASCII-8BIT then you can use Iconv to re-encode it.
References:
Encoding
Iconv
String#force_encoding
Related
How can I get a list of compatible encodings for a Ruby String? (MRI 1.9.3)
Use case: I have some user provided strings, encoded with UTF-8. Ideally I need to convert them to ISO/IEC 8859-1 (8-bit), but I also need to fallback to unicode when some special characters are present.
Also, is there a better way to accomplis this? Maybe I am testing the wrong thing.
EDIT- adding more details
Tanks for the answers, I should probably add some context.
I know how to perform encoding conversion.
I'm looking for a way to quickly find out if a string can be safely encoded to another encoding or, to put it in another (and quite wrong) way, what is the minimum encoding to support all the characters in that string.
Just converting the strings to 16-byte is not an option, because they will be sent as SMSs and converting them to a 16-byte encoding cuts the amount of available characters from 160 down to 70.
I need to convert them to 16-bytes only when they contain a special character which is not supported in ISO/IEC 8859-1.
Unluckily, Ruby’s ideas of encoding compatibility are not fully congruent with your use case. However, trying to encode your UTF-8 string in ISO-8859-1 and catching the error that is thrown when a conversion is not possible will achieve what you are after:
begin
'your UTF-8 string'.encode!('ISO-8859-1')
rescue Encoding::UndefinedConversionError
end
will convert your string to ISO-8859-1 if possible and leave it as UTF-8 if not.
Note this uses encode, which actually transcodes the string using Encoding::Converter (i.e. reassigns the correct encoding byte pattern to the character representations of the string), unlike force_encoding, which just changes the encoding flag (i.e. tells Ruby to interpret the string’s byte stream according to the set encoding).
Ruby has standard library in which u can find class Encoding and his sub-class called Encoding::Converter they are probably your best friends in this case.
#!/usr/bin/env ruby
# encoding: utf-8
converter = Encoding::Converter.new("UTF-8", "ISO-8859-1")
converted = converter.convert("é")
puts converted.encoding
# => ISO-8859-1
puts converted.dump
# => "\xE9"
Is valid_encoding? (instance method of String) useful? That is:
try_str = str.force_encoding("ISO/IEC 8859-1")
str = try_str if try_str.valid_encoding?
To convert to ISO-8859-1 you can follow the below code to encode it.
1.9.3p194 :002 > puts "é".force_encoding("ISO-8859-1").encode("UTF-8")
é
=> nil
Linked Answer
"Some String".force_encoding("ISO/IEC 8859-1")
Also you can refer rails encoding link
I need to write a CVS export program which internally use UTF-8 encoding which originated from user input via web(so you can expect any characters). It's Japanese system so I need to encode to Shift_JIS.
Now, when I change UTF-8 into Shift_JIS, I get errors like:
Encoding::UndefinedConversionError (U+7E6B from UTF-8 to Shift_JIS):
I want to either a) eliminate the character, or b) map the character to some other character
(or simply, to string '(U+7E6B)')
It seems catch the exception and eliminate it as byte string but there must be easier way to do this.
What is the best way to do this conversion?
[Converting my follow-up comments to question to an answer]
I found encode has option and I can give encode with
:undef=>true, # for UndefinedConversionError :replace=>"?"
to have desired effect. can specify following also:
:invalid=>true, # for InvalidByteSequenceError
I have a field scraped from a utf-8 page:
"O’Reilly"
And saved in a yml file:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 is the correct UTF-8 representation of this apostrophe)
However when I load the value into a hash and yield it to a page tagged as utf-8, I get:
OâReilly
I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
How do I make rails interpret a 3-byte UTF-8 code as a single character?
Ruby strings are sequences of bytes instead of characters:
$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"
Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to
O’Reilly
This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.
Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:
#in ruby environment
puts YAML::ENGINE.yamler => syck
#in rails
puts YAML::ENGINE.yamler => psych
#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'
I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.
The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2 byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.
If this is the case you might have to force_encoding the read strings back to what they should have been, or set default_internal to cause the strings to be read back into UTF-8. Bit of a mess this.
I am quite new to Ruby and Ruby on Rails.
While following a Ruby guide to create a small food finding application, the author used the number to currency method from Ruby on Rails. The trouble is the default unit is $ but I would like to change it to £.
When I did this in it gave me back the following error after I tried to run the code.
number_helper.rb:7 invalid multibyte char (US-ASCII) (SyntaxError)
Put the following in the first line of your file where you have £.
#coding: utf-8
By default, ruby can read one-byte characters, which are US-ASCII characters. The £ character does not fit within US-ASCII code, and the magic comment above let ruby read the file as UTF-8 code, which is becoming standard, and is capable of handeling multi-byte characters, including £ (Added following the suggestion by the Tin Man).
Edit
With Ruby 2.0 to be published this month, the default encoding will be UTF-8, so you will not need to do this anymore.
In RoR,how to validate a Chinese or a Japanese word for a posting form with utf8 code.
In GBK code, it uses [\u4e00-\u9fa5]+ to validate Chinese words.
In Php, it uses /^[\x{4e00}-\x{9fa5}]+$/u for utf-8 pages.
Ruby 1.8 has poor support for UTF-8 strings. You need to write the bytes individually in the regular expression, rather then the full code:
>> "acentuação".scan(/\xC3\xA7/)
=> ["ç"]
To match the range you specified the expression will become a bit complicated:
/([\x4E-\x9E][\x00-\xFF])|(\x9F[\x00-\xA5])/ # (untested)
That will be improved in Ruby 1.9, though.
Edit: As noted in the comments, the unicode characters \u4E00-\u9FA5 only map to the expression above in the UTF16-BE encoding. The UTF8 encoding is likely different. So you need to analyze the mapping carefully and see if you can come up with a byte-matching expression for Ruby 1.8.
This is what i have done:
%r{^[#{"\344\270\200"}-#{"\351\277\277"}]+$}
This is basically a regular expression with the octal values that represent the range between U+4E00 and U+9FFF, the most common Chinese and Japanese characters.
The Oniguruma regexp engine has proper support for Unicode. Ruby 1.9 uses Oniguruma by default. Ruby 1.8 can be recompiled to use it.
With Oniguruma you can use the exact same regex as in PHP, including the /u modifier to force Ruby to treat the string as UTF-8.
activeSupport has a UTF-8 handler
http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html
otherwise, look in ruby 1.9, encoding method for Regexp objects