Rails: encoding woes with serialized hashes despite UTF8 - ruby-on-rails

I've just updated from ruby 1.9.2 to ruby 1.9.3p0 (2011-10-30 revision 33570). My rails application uses postgresql as its database backend. The system locale is UTF8, as is the database encoding. The default encoding of the rails application is also UTF8. I have Chinese users who input Chinese characters as well as English characters. The strings are stored as UTF8 encoded strings.
Rails version: 3.0.9
Since the update some of the existing Chinese strings in the database are no longer displayed correctly. This does not affect all strings, but only those that are part of a serialized hash. All other strings that are stored as plain strings still appear to be correct.
Example:
This is a serialized hash that is stored as a UTF8 string in the database:
broken = "--- !map:ActiveSupport::HashWithIndifferentAccess \ncheckbox: \"1\"\nchoice: \"Round Paper Clips \\xEF\\xBC\\x88\\xE5\\x9B\\x9E\\xE5\\xBD\\xA2\\xE9\\x92\\x88\\xEF\\xBC\\x89\\r\\n\"\ninfo: \"10\\xE7\\x9B\\x92\"\n"
In order to convert this string to a ruby hash, I deserialize it with YAML.load:
broken_hash = YAML.load(broken)
This returns a hash with garbled contents:
{"checkbox"=>"1", "choice"=>"Round Paper Clips ï¼\u0088å\u009B\u009Eå½¢é\u0092\u0088ï¼\u0089\r\n", "info"=>"10ç\u009B\u0092"}
The garbled stuff is supposed to be UTF8-encoded Chinese. broken_hash['info'].encoding tells me that ruby thinks this is #<Encoding:UTF-8>. I disagree.
Interestingly, all other strings that were not serialized before look fine, however. In the same record a different field contains Chinese characters that look just right---in the rails console, the psql console, and the browser. Every string---no matter if serialized hash or plain string---saved to the database since the update looks fine, too.
I tried to convert the garbled text from a possible wrong encoding (like GB2312 or ANSI) to UTF-8 despite ruby's claim that this was already UTF-8 and of course I failed. This is the code I used:
require 'iconv'
Iconv.conv('UTF-8', 'GB2312', broken_hash['info'])
This fails because ruby doesn't know what to do with illegal sequences in the string.
I really just want to run a script to fix all the old, presumably broken serialized hash strings and be done with it. Is there a way to convert these broken strings to something resembling Chinese again?
I just played with the encoded UTF-8 string in the raw string (called "broken" in the above example). This is the Chinese string that is encoded in the serialized string:
chinese = "\\xEF\\xBC\\x88\\xE5\\x9B\\x9E\\xE5\\xBD\\xA2\\xE9\\x92\\x88\\xEF\\xBC\\x89\\r\\n\"
I noticed that it is easy to convert this to a real UTF-8 encoded string by unescaping it (removing the escape backslashes).
chinese_ok = "\xEF\xBC\x88\xE5\x9B\x9E\xE5\xBD\xA2\xE9\x92\x88\xEF\xBC\x89\r\n"
This returns a proper UTF-8-encoded Chinese string: "(回形针)\r\n"
The thing falls apart only when I use YAML.load(...) to convert the string to a ruby hash. Maybe I should process the raw string before it is fed to YAML.load. Just makes me wonder why this is so...
Interesting! This is likely due to the YAML engine "psych" that's used by default now in 1.9.3. I switched to the "syck" engine with YAML::ENGINE.yamler = 'syck' and the broken strings are correctly parsed.

This seems to have been caused by a difference in the behaviour of the two available YAML engines "syck" and "psych".
To set the YAML engine to syck:
YAML::ENGINE.yamler = 'syck'
To set the YAML engine back to psych:
YAML::ENGINE.yamler = 'psych'
The "syck" engine processes the strings as expected and converts them to hashes with proper Chinese strings. When the "psych" engine is used (default in ruby 1.9.3), the conversion results in garbled strings.
Adding the above line (the first of the two) to config/application.rb fixes this problem. The "syck" engine is no longer maintained, so I should probably only use this workaround to buy me some time to make the strings acceptable for "psych".

From the 1.9.3 NEWS file:
* yaml
* The default YAML engine is now Psych. You may downgrade to syck by setting
YAML::ENGINE.yamler = 'syck'.
Apparently the Syck and Psych YAML engines treat non-ASCII strings in different and incompatible ways.
Given a Hash like you have:
h = {
"checkbox" => "1",
"choice" => "Round Paper Clips (回形针)\r\n",
"info" => "10盒"
}
Using the old Syck engine:
>> YAML::ENGINE.yamler = 'syck'
>> h.to_yaml
=> "--- \ncheckbox: "1"\nchoice: "Round Paper Clips \\xEF\\xBC\\x88\\xE5\\x9B\\x9E\\xE5\\xBD\\xA2\\xE9\\x92\\x88\\xEF\\xBC\\x89\\r\\n"\ninfo: "10\\xE7\\x9B\\x92"\n"
we get the ugly double-backslash format the you currently have in your database. Switching to Psych:
>> YAML::ENGINE.yamler = 'psych'
=> "psych"
>> h.to_yaml
=> "---\ncheckbox: '1'\nchoice: ! "Round Paper Clips (回形针)\\r\\n"\ninfo: 10盒\n"
The strings stay in normal UTF-8 format. If we manually screw up the encoding to be Latin-1:
>> Iconv.conv('UTF-8', 'ISO-8859-1', "\xEF\xBC\x88\xE5\x9B\x9E\xE5\xBD\xA2\xE9\x92\x88\xEF\xBC\x89")
=> "ï¼\u0088å\u009B\u009Eå½¢é\u0092\u0088ï¼\u0089"
then we get the sort of nonsense that you're seeing.
The YAML documentation is rather thin so I don't know if you can force Psych to understand the old Syck format. I think you have three options:
Use the old unsupported and deprecated Syck engine, you'd need to YAML::ENGINE.yamler = 'syck' before you YAML anything.
Load and decode all your YAML using Syck and then re-encode and save it using Psych.
Stop using serialize in favor of manually serializing/deserializing using JSON (or some other stable, predictable, and portable text format) or use an association table so that you're not storing serialized data at all.

Related

Rails 5 - Problem with Umlaute and parameterize method

I work on Rails 5.2.3 and Ruby 2.5.1. At some point, I found an issue when I expected my array-of-string constant to contain some string but it didn't. Turned out the problem was related to German Umlaute characters (öäü).
So I have the constant defined like the following:
# coding: utf-8
# frozen_string_literal: true
class MyClass
module MyModule
MY_CONSTANT = [
'Breite in mm',
'Höhe in mm',
'Länge in mm'
].map(&:parameterize).freeze
end
end
I expect the constant to look like ["breite-in-mm", "hoehe-in-mm", "laenge-in-mm"]
But instead it's stored as ["breite-in-mm", "hohe-in-mm", "lange-in-mm"]. You see, "ö" has been converted to "o" instead of "oe". Same for "ä". Now it's "a", not "ae".
It works this way on production, in RSpec tests and even when I start Rails console and call this constant. But when I define a new constant from Rails console using the very same code, the strings are being successfully converted to what I expect, i.e. ["breite-in-mm", "hoehe-in-mm", "laenge-in-mm"]
I could easily get rid of this parameterize method and just type in the strings as I need them. Maybe I will have to do that. But I'm really curious about why all this is happening and couldn't find an answer by myself.
So thank you in advance for any ideas.
The parameterize method in Rails (through its use of ActiveSupport::Inflector#transliterate) is in locale aware. It thus uses locale-depending rules to transliterate characters such as Umlauts to ASCII characters.
When your app handles a request (or at least once after booting), you are usually setting a I18n locale, e.g. with I18n.locale = :de for a single request or with I18n.default_locale = :de for your whole app. After that, Rails (resp. the i18n gem) used this locale by default for its transliteration rules.
When initially setting your constant, this default locale was likely not yet set. The i18n gem is thus not aware of the German transliteration rules and uses only the basic Unicode normalization rules.
As a workaround, you can either pass the desired locale to use to the parameterize method as
MY_CONSTANT = [
'Breite in mm',
'Höhe in mm',
'Länge in mm'
].map { |const| const.parameterize(locale: :de).freeze }.freeze
or you can alternatively set the default i18n locale earlier than when your code is executed (e.g. in a file in config/initializers, depending on where exactly you initialize your constant):
I18n.default_locale = :de
Thanks, Holger Just for your great answer. It seems to be correct except it only works for Rails 6.0.0. So I'm going to post the one for Rails 5.2.3 which I'm using on my project.
Unfortunately in Rails 5 parameterize method does not accept the locale argument yet. This will be possible only in Rails 6.
But still, as mentioned in Holger Just's answer, parameterize method relies on transliterate method which does actually use current locale and converts strings according to it.
See Rails 5.2.3 docs and sources for those methods:
https://api.rubyonrails.org/v5.2.3/classes/ActiveSupport/Inflector.html#method-i-parameterize
https://api.rubyonrails.org/v5.2.3/classes/ActiveSupport/Inflector.html#method-i-transliterate
So I cannot pass the locale to parameterize method directly. Then I should set the locale before my constant is defined.
Setting I18n.default_locale = :de inside application.rb file did not help. I already had that and the strings have been transliterated regardless.
What eventually helped was setting I18n.locale = :de manually. Thanks to this, I got my strings parameterized correctly without any changes to MyConstant definition.

Fixtures: How to load utf-8 characters and convert to non utf-8 and save into database?

I am new to Rails.
I am working on a legacy code which relies on Fixtures to load non UTF8 characters from yml files into database.
With rails lastest version, the yml must be utf-8 encoding, meaning that what Fixtures read in will be utf-8 encoding.
How to convert the following codes so that it will convert from utf-8 to non-utf8 and load into database (non-utf8 encoding of course).
class LoadUsersData < ActiveRecord::Migration
def self.up
down
directory = File.join(File.dirname(__FILE__))
Fixtures.create_fixtures(directory, "users")
end
def self.down
User.delete_all
end
end
where users.yml contains data in utf-8 format (have to).
Or is it possible to use .txt (can be in non-utf8 format) and load with Fixtures?
I am using oracle_enhanced, and I tried to change "encoding" to my database encoding in database.yml, but it seems that the conversion is not done automatically from utf-8 to my encoding (because the string length is different for these two encodings with utf-8 has more bytes, and the database returns that user_name is too long error).
I solved this by
add the following line in boot.rb
ENV['NLS_LANG'] ||= 'AMERICAN_AMERICA.JA16SJIS'
convert all yml files to utf-8 format (Note that uses sjis formatted yml and ruby gem syck will load the yml but fixtures wont work even yml is loaded).

Rails encoding in ASCII-8BIT

I know this have been asked several times, but to me is happening something strange:
I have an index view where rendering certain characters (letters with accent) causes Rails to raise the exception
incompatible character encodings: ASCII-8BIT and UTF-8
so i checked my strings encoding and this is actually ASCII-8BIT everywhere, even though i set the proper encoding to UTF-8 in my application.rb
config.encoding = "utf-8"
and in my enviroment.rb
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
and in my database it appear:
character_set_database = utf-8
as suggestend in some guides.
Strings are inserted with a textarea field and are not concatenated to any other already inserted string.
The strange things are:
this happens only in the index view, whereas this is not happening in the show (same resource)
this happens only for this model (which is an email, with subject and body, but this shouldn't affect anything)
In my development environment everything goes well setting str.force_encoding('utf-8'), whereas in my production environment this is not working. (dev i'm with Ruby 2.0.0, in production Ruby 2.1.0, both Rails4, and both MySql)
setting the file view with # encoding utf-8 also doesn't work
trying str.force_encoding('ascii-8bit').encode('utf-8') says Encoding::UndefinedConversionError "\xC3" from ASCII-8BIT to UTF-8 which is an à, while using body.force_encoding('ascii-8bit').encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '?'), replaces all accented charaters with a ?, while str.force_encoding('iso-8859-1').encode('utf-8') obviously generates the wrong character (a ?).
So my questions are 2:
- why is rails setting the string encodint to ascii-8bit?
- how to solve this issue?
I've already checked these questions (the newest ones with rails4):
Rails View Encoding Issues
"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8
How to convert a string to UTF8 in Ruby
Encoding::UndefinedConversionError: "\xE4" from ASCII-8BIT to UTF-8
and other resources also, but nothing worked.
You probably have a string literal in your source code somewhere that you then concatenate another string too. For instance:
some_string = "this is a string"
or even
some_string = "" #empty string
Those strings, stored in some_string, will be marked ASCII_8BIT, and if you then later do something like:
some_string = some_string + unicode_string
Then you'll get the error. That is, those strings will be marked ASCII-8BIT unless you add, to the top of the file where the string literals are created:
#encoding: utf-8
That declaration determines the default encoding that string literals in source code will have.
I am just guessing, because this pattern is a common source of this problem. To know more for sure, it would take more information than is in your question -- it would take debugging the actual source code, to figure out exactly what string is tagged as ASCII-8BIT when you expect it to be tagged UTF-8 instead, and exactly where that String came from.

Avoid using # encoding: UTF-8

I ran into a problem with a Rails controller where it choked on a Unicode string:
syntax error, unexpected $end, expecting ']'
...conditions => ['url like ?', "%日本%"])
The solution to this problem was to set the encoding at the top of the controller file using
# encoding: UTF-8
Is there any way to set this globally? I keep on getting into trouble by forgetting to set it in files. Alternatively, is there a default somewhere that will make sure that all strings are thought of as Unicode? Are there any problems with setting everything to be Unicode?
In less than a month, Ruby 2.0 will be released, which will have UTF-8 as the default encoding. Then, you will not need to do that any more.
You can try setting environment variable RUBYOPT to -Ku value:
export RUBYOPT="-Ku"

RoR character class regex

I have the following line of code in my Ruby on Rails app, which checks whether the given string contains Korean characters or not:
isKorean = !/\p{Hangul}/.match(word).nil?
It works perfectly in the console, but raises a syntax error for the actual app:
invalid character property name {Hangul}: /\p{Hangul}/
What am I missing and how can I get it to work?
This is a character encoding issue, you need to add:
# encoding: utf-8
to the top of the Ruby file you're using that regex in. You can probably use any encoding that the character class you're using exists in instead of UTF-8 if you wish. Note that in Ruby 2.0, UTF-8 is now the default, so this is not needed in Ruby 2.0+.
This is known as a "magic comment". You can and should read more about encoding in Ruby 1.9. Note that encoding in Rails views is handled automatically by config.encoding (set to UTF-8 by default in config/application.rb.
It was likely working in the console because your terminal is set to use UTF-8 already.

Resources