"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8 - ruby-on-rails

I have a rails project that runs fine with MRI 1.9.3. When I try to run with Rubinius I get this error in app/views/layouts/application.html.haml:
"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8

It turns out the page had an invalid character (an interpunct '·'), which I found out with the following code (credits to this gist and this question):
lines = IO.readlines("app/views/layouts/application.html.haml").map do |line|
line.force_encoding('ASCII-8BIT').encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '?')
end
File.open("app/views/layouts/application.html.haml", "w") do |file|
file.puts(lines)
end
After running this code, I could find the problematic characters with a simple git diff and moved the code to a helper file with # encoding: utf-8 at the top.
I'm not sure why this doesn't fail with MRI but it should since I'm not specifying the encoding of the haml file.

Related

Encoding::UndefinedConversionError "\xE7" from ASCII-8BIT to UTF-8

I'm working on a rails API project.
Here is my code snippets
class PeopleController < ApplicationController
respond_to :json
def index
respond_with Person.all
end
end
and when I visit the url localhost:3000/people.json
Encoding::UndefinedConversionError at /people.json
"\xE7" from ASCII-8BIT to UTF-8
I'm trying to solve this issue since last week, but still fighting with this.
I've found the bunch of similar question over stackoverflow such as this & this but non of the solution worked for me.
Here are the configuration I've.
Rails 4.2.7.1
ruby-2.3.1
Operating system: macOS Sierra
Output of locale
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
content on ~/.bash_profile
export LC_CTYPE="utf-8"
export LC_CTYPE=en_US.UTF-8
export LANG=en_US.UTF-8
unset LC_ALL
Output of Encoding.default_external
#<Encoding:UTF-8>
I have had this problem a lot of times, so I usually try to get rid of any characters that are invalid to UTF-8 BEFORE saving it in the Database. If you have your record saved as a String you can replace invalid characters like so:
string = "This contains an invalid character \xE7"
string.encode('UTF-8', invalid: :replace, undef: :replace)
#=> "This contains an invalid character �"
This is ofc prior to converting it to a JSON object.

Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8): error in ROR + MongoDB based app

Had a developer write this method and its causing a Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8): error.
This error only happens randomly so the data going in is original DB field is what is causing the issue. But since I don't have any control over that, what can I put in the below method to fix this so bad data doesn't cause any issues?
def scrub_string(input, line_break = ' ')
begin
input.an_address.delete("^\u{0000}-\u{007F}").gsub("\n", line_break)
rescue
input || ''
end
end
Will this work?
input = input.encode('utf-8', :invalid => :replace, :undef => :replace, :replace => '_')
Yeah this should work, it'll replace any weird characters that can't be converted into UTF-8 with an underscore.
Read more about encoding strings in ruby here:
http://ruby-doc.org/core-1.9.3/String.html#method-i-encode

Prawn encoding not correct

I have problems runnig this code with Prawn:
require 'prawn'
Prawn::Document.generate "example.pdf" do |pdf|
pdf.text_box "W\xF6rth".force_encoding('UTF-8'), :at => [200,720], :size => 32
end
somehow i get this error:
`rescue in normalize_encoding': Arguments to text methods must be UTF-8
encoded(Prawn::Errors::IncompatibleStringEncoding)
But when i try this code, it works:
pdf.text_box "Wörth".force_encoding('UTF-8')
What do i wrong? How can i also fix my first example with the \xF6 in the string? Thanks!
"W\xF6rth" is not a valid UTF-8 sequence.
"W\xF6rth".valid_encoding?
=> false
The maximum one-byte character code in UTF-8 is 0x7F, after that you need to start encoding with two bytes.
"Wörth".bytes.map { |b| b.to_s(16) }
=> ["57", "c3", "b6", "72", "74", "68"]
^^----^^ <-- Two bytes representing UTF-8 "ö"
I think you're trying to convert ISO-8859-1 to UTF-8.
In ISO-8859-1 "ö" is 0xF6.
This is what should work in your case:
"W\xf6rth".force_encoding('iso-8859-1').encode('utf-8')
=> "Wörth"
I.e.
pdf.text_box "W\xF6rth".force_encoding('iso-8859-1').encode('utf-8'), ...
References:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/UTF-8

Ruby string force encoding

How can I force encode this: Al-F\u0026#257;ti\u0026#293;ah to Al-Fātiĥah
I tried .encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "") and force_encoding("UTF-8") with no luck
That text seems to include HTML or XML entities.
Try
require "cgi/util"
CGI.unescapeHTML("Al-F\u0026#257;ti\u0026#293;ah")
or
# gem install nokogiri
require "nokogiri"
Nokogiri::XML.fragment("Al-F\u0026#257;ti\u0026#293;ah").text
See: Converting escaped XML entities back into UTF-8

How to change the encoding during CSV parsing in Rails

I would like to know how can I change the encoding of my CSV file when I import it and parse it. I have this code:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
csv.each do |row|
row = row.to_hash.with_indifferent_access
insert_data_method(row)
end
When I read my file, I get this error:
Encoding::CompatibilityError in FileImportingController#load_file
incompatible character encodings: ASCII-8BIT and UTF-8
I read about row.force_encoding('utf-8') but it does not work:
NoMethodError in FileImportingController#load_file
undefined method `force_encoding' for #<ActiveSupport::HashWithIndifferentAccess:0x2905ad0>
Thanks.
I had to read CSV files encoded in ISO-8859-1.
Doing the documented
CSV.foreach(filename, encoding:'iso-8859-1:utf-8', col_sep: ';', headers: true) do |row|
threw the exception
ArgumentError: invalid byte sequence in UTF-8
from csv.rb:2027:in '=~'
from csv.rb:2027:in 'init_separators'
from csv.rb:1570:in 'initialize'
from csv.rb:1335:in 'new'
from csv.rb:1335:in 'open'
from csv.rb:1201:in 'foreach'
so I ended up reading the file and converting it to UTF-8 while reading, then parsing the string:
CSV.parse(File.open(filename, 'r:iso-8859-1:utf-8'){|f| f.read}, col_sep: ';', headers: true, header_converters: :symbol) do |row|
pp row
end
force_encoding is meant to be run on a string, but it looks like you're calling it on a hash. You could say:
output.force_encoding('utf-8')
csv = CSV.parse(output, :headers => true, :col_sep => ";")
...
Hey I wrote a little blog post about what I did, but it's slightly more verbose than what's already been posted. For whatever reason, I couldn't get those solutions to work and this did.
This gist is that I simply replace (or in my case, remove) the invalid/undefined characters in my file then rewrite it. I used this method to convert the files:
def convert_to_utf8_encoding(original_file)
original_string = original_file.read
final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
final_file = Tempfile.new('import') #No need to save a real File
final_file.write(final_string)
final_file.close #Don't forget me
final_file
end
Hope this helps.
Edit: No destination encoding is specified here because encode assumes that you're encoding to your default encoding which for most Rails applications is UTF-8 (I believe)

Resources