Ruby 2.1.5 - ArgumentError: invalid byte sequence in UTF-8 - ruby-on-rails

I'm having trouble with UTF8 chars in Ruby 2.1.5 and Rails 4.
The problem is, the data which come from an external service are like that:
"first_name"=>"ezgi \xE7enberci"
"last_name" => "\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"
These characters mostly include Turkish alphabet characters like "üğşiçö". When the application tries to save these data, the errors below occur:
ArgumentError: invalid byte sequence in UTF-8
Mysql2::Error: Incorrect string value
How can I fix this?

What's Wrong
Ruby thinks you have invalid byte sequences because your strings aren't UTF-8. For example, using the rchardet gem:
require 'chardet'
["ezgi \xE7enberci", "\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"].map do str
puts CharDet.detect str
end
#=> [{"encoding"=>"ISO-8859-2", "confidence"=>0.8600826867857209},
{"encoding"=>"windows-1255", "confidence"=>0.5807177322740268}]
How to Fix It
You need to use String#scrub or one of the encoding methods like String#encode! to clean up your strings first. For example:
hash = {"first_name"=>"ezgi \xE7enberci",
"last_name"=>"\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"}
hash.each_pair { |k,v| k[v.encode! "UTF-8", "ISO-8859-2"] }
#=> {"first_name"=>"ezgi çenberci", "last_name"=>"üţçđiţţöç"}
Obviously, you may need to experiment a bit to figure out what the proper encoding is (e.g. ISO-8859-2, windows-1255, or something else entirely) but ensuring that you have a consistent encoding of your data set is going to be critical for you.
Character encoding detection is imperfect. Your best bet will be to try to find out what encoding your external data source is using, and use that in your string encoding rather than trying to detect it automatically. Otherwise, your mileage may vary.

That doesn't look like utf-8 data so this exception is normal. Sounds like you need to tell ruby what encoding the string is actually in:
some_string.force_encoding("windows-1254")
You can then convert to UTF8 with the encode method. There are gems (eg charlock_holmes) that have heuristics for auto detecting encodings if you're getting a mix of encodings

Related

Parse Hash in Ruby

How can I parse a hash in ROR?
I have a hash in string format(enclosed by double quotes) and i need to parse them to a valid hash.
eg.
input_hash = "{"name" => "john"}"
desired
output_hash = {"name" => "john"}
This is the wrong approach. String representation of a ruby hash is not a good way to serialise data. It is well structured, and definitely possible to get it back to a ruby hash (eval), but it's extremely dangerous and can give an attacker who has control over the input string full control over your system.
Approach the problem from a different angle. Look for where the string gets stored and change the code there instead. Store it for example as JSON. Then it can easily and safely be parsed back to a hash and can also be sent to systems running on something that is not ruby.

Unable to convert url params string to hash and then back to url string

Input is a params string (Input format cannot be changed)
something=1,2,3,4,5&something_else=6,7,8
Expected Output:
something=1,2,3,4,5&something_else=6,7,8
What I am doing:
params = 'something=1,2,3,4,5'
CGI::parse(params)
CGI.unescape(CGI.parse(params).to_query)
And I am getting this as output:
something[]=1,2,3,4,5
When I did CGI::parse(params)
I am getting this : {"something"=>["1,2,3,4,5"]}
which is wrong because it is not an array, something is a string which is "1,2,3,4,5" but it is being converted as array when I did CGI parse.
The reason I need to do CGI parse is because I need to manipulate the url PARAMS.
Is there any other possible way where I can convert it in the right way and maintain the params format?
The CGI module is a complete dinosaur and should probably be thrown in the garbage because of how bad it is, but for some reason it persists in the Ruby core. Maybe some day someone will refactor it and make it workable. Until then, skip it and use something better like URI, which is also built-in.
Given your irregular, non-compliant query string:
query_string = 'something=1,2,3,4,5&something_else=6,7,8'
You can handle this by using the decode_www_form method which handles query-strings:
require 'uri'
decoded = URI.decode_www_form(query_string).to_h
# => {"something"=>"1,2,3,4,5", "something_else"=>"6,7,8"}
To re-encode it you just call encode_www_form and then force unescape to undo what it's correctly doing to handle the , values:
encoded = URI.unescape(URI.encode_www_form(decoded))
# => "something=1,2,3,4,5&something_else=6,7,8"
That should get the effect you want.

Mongoid can not query non-latin attributes

My Mongoid document has two attributes: :en_name and ru_name. I have created one model:
MyModel.create(en_name: 'sport', ru_name: 'спорт')
Then I query it:
MyModel.where(en_name: 'sport').first
It returns me my model.
When I try to query this:
MyModel.where(ru_name: 'спорт').first
It returns me nil
How to make Mongoid able to query attributes which are non-latin?
Mongodb uses UTF-8. However, you may experience problems if the server is running on Windows because Windows uses CP1251.
Use Robomongo (cross-platform graphical client) for that would make sure that the data has been written to the database in the correct encoding.
BSON can be encoded only in UTF-8. If the data is not displayed correctly, you probably are not converting your data to UTF-8 before uploading it to mongodb.
Check encoding encoding_name = str.encoding.name
Convert encoding utf_str = Iconv.conv('windows-1251', 'utf-8', str)

What is the best way to replace smart quotes, smart apostrophe and ellipsis in Rails 3?

My application allows the user to enter text. When they copy and paste from MS Word, it pastes smart quotes, smart apostrophes and ellipsis. These characters get saved into the database and cause problems. What is the best way to replace these non-UTF-8 characters with normal quotes("), apostrophe(') and periods(...)?
Also, how do you test this functionality? I added a test with these special characters and # encoding: ISO-8859-1 at the top of the file. The special characters caused the tests stop running: /home/george/.rvm/gems/ruby-1.9.2-p180/gems/redgreen-1.2.2/lib/redgreen.rb:62:in 'sub': invalid byte sequence in UTF-8 (ArgumentError)...Apparently redgreen gem is incompatible with these characters...?
Thanks.
you can add a before_save method that will convert your text to UTF-8 corresponding characters. if you have just 1 field that might contain non-UTF8 chars then its simple, if you have many fields then it would be better if you dynamically iterate over changed text/string fields and fix UTF-8 problem. Either way you need to use String#encode. Here is an example
before_save :fix_utf8_encoding
def fix_utf8_encoding
columns = self.class.columns.select{|col| [:text,:string].include?(col.type)}.map{|col| col.name.to_sym}
columns.each do |col|
self[col] = self.self[col].encode('UTF-8', :invalid => :replace, :undef => :replace) if self[col].kind_of?(String) #Double checking just in case.
end
end
private :fix_utf8_encoding
And for bonus points you can also check if the field was changed using rails handy changed? helpers before fixing it.

Moving encoded data with Rails

I'm trying to move data from one database to another from within a rake task.
However, I'm getting some fruity encoding issues on some of the data:
rake aborted!
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0x92
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
What can I do to resolve this error and get the data in? As far as I can tell (not knowing anything about encoding), the source DB is latin1.
if both databases are PG then you can export and import the whole database using the pg_dump options to change the encoding... that would probably the most performant way to do it
if you do this via a rake task you can do the transcoding inside your rake-task... that actually means you will have to touch every attribute and reencode it...
as it seems your new database is utf8 whereas the old is latin1
you could do it by having every string/text/text-like value encoded using... checking for respond_to?(:encoding) makes sure the data is encoded only if it has some encoding information attached, i. e. numeric values wont be transcoded
def transcode(data, toEnc = 'utf8')
if data.respond_to?(:encoding) && data.encoding.name != toEnc
return data.dup.force_encoding toEnc
end
data
end
now you can just read a record from the old db, run it through this method and then write it to the new database
u = OldDBUser.first
u.attribute_names.each { |x|
u[x.to_sym] = transcode u[x.to_sym]
}
#... whatever you do with the transcoded u
... well I have not tested those, but please do, maybe its all you need

Resources