Ruby Strings, get list of compatible encodings - ruby-on-rails

How can I get a list of compatible encodings for a Ruby String? (MRI 1.9.3)
Use case: I have some user provided strings, encoded with UTF-8. Ideally I need to convert them to ISO/IEC 8859-1 (8-bit), but I also need to fallback to unicode when some special characters are present.
Also, is there a better way to accomplis this? Maybe I am testing the wrong thing.
EDIT- adding more details
Tanks for the answers, I should probably add some context.
I know how to perform encoding conversion.
I'm looking for a way to quickly find out if a string can be safely encoded to another encoding or, to put it in another (and quite wrong) way, what is the minimum encoding to support all the characters in that string.
Just converting the strings to 16-byte is not an option, because they will be sent as SMSs and converting them to a 16-byte encoding cuts the amount of available characters from 160 down to 70.
I need to convert them to 16-bytes only when they contain a special character which is not supported in ISO/IEC 8859-1.

Unluckily, Ruby’s ideas of encoding compatibility are not fully congruent with your use case. However, trying to encode your UTF-8 string in ISO-8859-1 and catching the error that is thrown when a conversion is not possible will achieve what you are after:
begin
'your UTF-8 string'.encode!('ISO-8859-1')
rescue Encoding::UndefinedConversionError
end
will convert your string to ISO-8859-1 if possible and leave it as UTF-8 if not.
Note this uses encode, which actually transcodes the string using Encoding::Converter (i.e. reassigns the correct encoding byte pattern to the character representations of the string), unlike force_encoding, which just changes the encoding flag (i.e. tells Ruby to interpret the string’s byte stream according to the set encoding).

Ruby has standard library in which u can find class Encoding and his sub-class called Encoding::Converter they are probably your best friends in this case.
#!/usr/bin/env ruby
# encoding: utf-8
converter = Encoding::Converter.new("UTF-8", "ISO-8859-1")
converted = converter.convert("é")
puts converted.encoding
# => ISO-8859-1
puts converted.dump
# => "\xE9"

Is valid_encoding? (instance method of String) useful? That is:
try_str = str.force_encoding("ISO/IEC 8859-1")
str = try_str if try_str.valid_encoding?

To convert to ISO-8859-1 you can follow the below code to encode it.
1.9.3p194 :002 > puts "é".force_encoding("ISO-8859-1").encode("UTF-8")
é
=> nil
Linked Answer

"Some String".force_encoding("ISO/IEC 8859-1")
Also you can refer rails encoding link

Related

CGI::unescape can't handle unescaping "wymiana+teflon%F3w"?

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

Rails oracle raw16

I'm using Rails 3.2.1 and I have stuck on some problem for quite long.
I'm using oracle enhanced adapter and I have raw(16) (uuid) column and when I'm trying to display the data there is 2 situations:
1) I see the weird symbols
2) I'm getting incompatible character encoding: Ascii-8bit and utf-8.
In my application.rb file I added the
config.encoding = 'utf-8'
and in my view file I added
'#encoding=utf-8'
But so far nothing worked
I also tried to add html_safe but it failed .
How can I safely diaply my uuid data?
Thank you very much
Answer:
I used the unpack method to convert the
binary with those parameters
H8H4H4H4H12 and in the end joined the
array :-)
The RAW datatype is a string of bytes that can take any value. This includes binary data that doesn't translate to anything meaningful in ASCII or UTF-8 or in any character set.
You should really read Joel Spolsky's note about character sets and unicode before continuing.
Now, since the data can't be translated reliably to a string, how can we display it? Usually we convert or encode it, for instance:
we could use the hexadecimal representation where each byte is converted to two [0-9A-F] characters (in Oracle using the RAWTOHEX function). This is fine for display of small binary field such as RAW(16).
you can also use other encodings such as base 64, in Oracle with the UTL_ENCODE package.

Eliminating non-convertable characters on encoding change from UTF-8 to Shift_JIS with ruby 1.9

I need to write a CVS export program which internally use UTF-8 encoding which originated from user input via web(so you can expect any characters). It's Japanese system so I need to encode to Shift_JIS.
Now, when I change UTF-8 into Shift_JIS, I get errors like:
Encoding::UndefinedConversionError (U+7E6B from UTF-8 to Shift_JIS):
I want to either a) eliminate the character, or b) map the character to some other character
(or simply, to string '(U+7E6B)')
It seems catch the exception and eliminate it as byte string but there must be easier way to do this.
What is the best way to do this conversion?
[Converting my follow-up comments to question to an answer]
I found encode has option and I can give encode with
:undef=>true, # for UndefinedConversionError :replace=>"?"
to have desired effect. can specify following also:
:invalid=>true, # for InvalidByteSequenceError

Detecting non-ASCII characters in Rails

I am wondering if there's a way to detect non-ASCII characters in Rails.
I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these characters in Rails? or just specify the range of characters I am expecting?
Is there a plugin for this? Thanks in advance!
All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)
You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.
class String
def multibyte?
chars.count < bytes.count
end
end
"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false
This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' ' and '~' when you have UTF-8 encoded text:
>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"
There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.
If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT or BINARY then you can probably get away with s.force_encoding('utf-8'). If you end up with something other than UTF-8 and ASCII-8BIT then you can use Iconv to re-encode it.
References:
Encoding
Iconv
String#force_encoding

Why does Rails 3 think xE2x80x89 means â x80 x89

I have a field scraped from a utf-8 page:
"O’Reilly"
And saved in a yml file:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 is the correct UTF-8 representation of this apostrophe)
However when I load the value into a hash and yield it to a page tagged as utf-8, I get:
OâReilly
I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
How do I make rails interpret a 3-byte UTF-8 code as a single character?
Ruby strings are sequences of bytes instead of characters:
$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"
Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to
O’Reilly
This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.
Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:
#in ruby environment
puts YAML::ENGINE.yamler => syck
#in rails
puts YAML::ENGINE.yamler => psych
#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'
I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.
The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2 byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.
If this is the case you might have to force_encoding the read strings back to what they should have been, or set default_internal to cause the strings to be read back into UTF-8. Bit of a mess this.

Resources