Force strings to UTF-8 from any encoding - ruby-on-rails

In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.
How can I detect encoding and convert to UTF-8?

Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8')
str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode:
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string

This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.
This will ensure no matter what, that you have a valid UTF-8 string
str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})

Iconv
require 'iconv'
i = Iconv.new('UTF-8','LATIN1')
a_with_hat = i.iconv("\xc2")
Summary: the iconv gem does all the work of converting encodings. Make sure it's installed with:
gem install iconv
Now, you need to know what encoding your string is currently in as Ruby 1.8 treats Strings as an array of bytes (with no intrinsic encoding.) For example, say your string was in latin1 and you wanted to convert it to utf-8
require 'iconv'
string_in_utf8_encoding = Iconv.conv("UTF8", "LATIN1", string_in_latin1_encoding)

Only this solution worked for me:
string.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Note the binary argument.

Related

Rails/Ruby invalid byte sequence in UTF-8 even after force_encoding

I'm trying to iterate over a remote nginx log file (compressed .gz file) in Rails and I'm getting this error at some point in the file:
TTPArgumentError: invalid byte sequence in UTF-8
I tried forcing the encoding too although it seems the encoding was already UTF8:
logfile = logfile.force_encoding("UTF-8")
The method that I'm using:
def remote_update
uri = "http://" + self.url + "/localhost.access.log.2.gz"
source = open(uri)
gz = Zlib::GzipReader.new(source)
logfile = gz.read
# prints UTF-8
print logfile.encoding.name
logfile = logfile.force_encoding("UTF-8")
# prints UTF-8
print logfile.encoding.name
logfile.each_line do |line|
print line[/\/someregex\/1\/(.*)\//,1]
end
end
Really trying to understand why this is happening (tried to look in other SO threads with no success). What's wrong here?
Update:
Added exception's trace:
HTTPArgumentError: invalid byte sequence in UTF-8
from /Users/T/workspace/sample_app/app/models/server.rb:25:in `[]'
from /Users/T/workspace/sample_app/app/models/server.rb:25:in `block in remote_update'
from /Users/T/workspace/sample_app/app/models/server.rb:24:in `each_line'
from /Users/T/workspace/sample_app/app/models/server.rb:24:in `remote_update'
from (irb):2
from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:110:in `start'
from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:9:in `start'
force_encoding doesn't change the actual string data: it just changes the variable that says what encoding to use when interpreting the bytes.
If the data is not in fact utf-8 or contains invalid utf-8 sequences then force encoding won't help. Force encoding is basically only useful when you get some raw data from somewhere and you know what encoding it is in and you want to tell ruby what that encoding is.
The first thing to do would be to determine what is the actual encoding used. The charlock_holmes gem can detect encodings. A more tricky case would be if the file was a mish-mash of encodings but hopefully that isn't the case (if it was, then perhaps trying to handle each line separately might work).
If you want to take a string, which has the correct encoding, and transcode it to valid UTF-8 and clean up invalid characters you can use something like:
str.encode!('UTF-8', invalid: :replace, undef: :replace, replace: '?')
If you have a UTF-8 encoded string which has invalid UTF-8 characters in it, you can clean that up by using the 'binary' encoding source:
str.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '?')
Both will give you a UTF-8 string with any invalid characters replaced by question marks which should pass. You can also pass in replace: '' to strip the bad characters, or leave the option off and you'll get the \uFFFD unicode character.
My guess is that the source file before gzipping had some binary data/corruption/invalid UTF-8 that got logged into it?
This question has also been asked and answered before on StackOverflow. See the following blog post for good information:
https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences
And here's a prior example of a SO answer:
https://stackoverflow.com/a/18454435/506908

Incompatible Character Encoding in rails - how to just fail/skip sensibly?

I'm having an issue when importing Email subjects via IMAP.
I'm getting a problem, I think related to the £ sign in email subjects.
Having spent a couple of hours touring around various answers I can't seem to find anything that works...
If I try the following...
Using ruby 2.1.2
views/emails/index
=email.subject
incompatible character encodings: ASCII-8BIT and UTF-8
=email.subject.scrub
incompatible character encodings: ASCII-8BIT and UTF-8
= email.subject.encode!('UTF-8', 'UTF-8', :invalid => :replace)
invalid byte sequence in UTF-8
= email.subject.force_encoding('UTF-8')
invalid byte sequence in UTF-8
= email.subject.encode("UTF-8", invalid: :replace)
"\xA3" from ASCII-8BIT to UTF-8
/xA3 is the '£' sign which shouldn't be that unusual.
I'm currently working with the following...
-if email.subject.force_encoding('UTF-8').valid_encoding?
=email.subject
-else
"Can't display"
What I would ideally do is just have something which checked if the encoding was working, and then did something like #scrub is supposed to do... I'd even take it with '/xA3' perfectly happily so long as it wasn't throwing an error and I could basically see the text.
Any ideas on either how to do it properly or a fudge to solve the issue?
After much pain this is how I solved it.
You need to add default encoding to your environment.rb file, like so:
# Load the rails application
require File.expand_path('../application', __FILE__)
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
# Initialize the rails application
Stma::Application.initialize!
Apparently this is something to do with Ruby's roots in japan. When dealing with Japanese (or russian) characters this wouldn't be helpful so this sort of thing isn't there as standard.
I've then done the following:
mail_object = Mail.new(mail[0].attr["RFC822"])
subject = mail_object.subject.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') if mail_object.subject
body_part = (mail_object.text_part || mail_object.html_part || mail_object).body.decoded
body = body_part.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') if body_part
from = mail_object.from.join(",") if mail_object.from #deals with multiple addresses
to = mail_object.to.join(",") if mail_object.to #deals with multiple addresses
That should get all the main pieces into strings / text you can easily work with that won't fail nastily if somethings missing/unusual...etc. Hope that helps somebody...

Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:
tr -cd '^[:print:]' < original.xml > clean.xml
Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.
The problem is that we're deploying on Heroku, and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:
data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
data.push(newrow)
end
end
Running this on the raw file produces an error:
"Invalid byte sequence in UTF-8"
Here are all the helpful suggestions I've tried but all have failed.
Use Coder
Coder.clean!(data_string, "UTF-8")
Force Encoding
data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
Convert to UTF-16 and back to UTF-8
data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
data_string.encode!('UTF-8', 'UTF-16')
Use valid_encoding?
data_string.chars.select{|i| i.valid_encoding?}.join
No characters are removed; generates "invalid byte sequence" errors.
Specify encoding on opening the file
I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (#file_encodings is an array of every possible file encoding):
#file_encodings.each do |enc|
print "#{enc}..."
conv_str = "r:#{enc}:utf-8"
begin
data_file = File.open(fname, conv_str)
data_string = data_file.read
rescue
data_file = nil
data_string = ""
end
data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")
unless data_string.blank? print "\n#{enc} detected!\n"
return data_string
end
Use Regexp to remove non-printables:
data_string.gsub!(/[^[:print:]]/,"")
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
(I also tried variants including /[^a-zA-Z0-9~`!##$%^&*()-_=+[{]}\|;:'",<.>/\?]/)
For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.
So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.
What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):
data_string.gsub!(/[Crazy Control Characters]/,"")
But the purist in me insists there should be a more elegant, general solution.
Ruby 2.1 has a new method called String.scrub which is exactly what you need.
If the string is invalid byte sequence then replace invalid bytes with
given replacement character, else returns self. If block is given,
replace invalid bytes with returned value of the block.
Check the documentation for more information.
I found this on Stack Overflow for some other question and this too worked fine for me. Assuming data_string is your XML:
data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Thanks for the responses. I did find something that works by testing all sorts of combinations of different tools. I hope this is helpful to other people who have shared the same frustration.
data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).
My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences; and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. I had to fix both problems, in that order, otherwise gsub! throws the "invalid byte sequence" error.
Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:
Coder.clean!(data_string,'UTF-8')
or
data_string.scrub!("")
This worked perfectly for me.
Try using a combination of force_encoding("ISO-8859-1") and encode("utf-8"):
data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
This helped me once.

Ruby on Rails: UTF-8 encoding string that has %F1 in content

I'm struggling to find the right method in Rails that can convert UTF-8 codes to its displayable value.
In my case, it's converting some user input like "John%20Da%F1e" to "John Dañe" if possible.
Currently, i have the following:
unescaped_name = CGI::unescape(params[:name]) # this turns "John%20Da%F1e" into "John Da\xF1e"
#q = I18n.transliterate(unescaped_q) #this yields an 'invalid byte sequence in UTF-8' error
In essence, i'm trying to go from "John%20Da%F1e" (already encoded in UTF-8) to "John Dañe".
One thing i've tried was
.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
but that replaces the ascii (% to \x) to "John Dae".
You need to tell Ruby what the encoding of the parsed string should be. It looks like you are working in Latin-1 to start with ('ISO-8859-1'). There are a few different options. If you want to limit this decision to just the string you are processing, you can use .force_encoding like this
require 'cgi'
unescaped_name = CGI::unescape( "John%20Da%F1e" ).force_encoding('ISO-8859-1')
# => "John Da\xF1e"
unescaped_name.encode('UTF-8')
# => "John Dañe"
Note that once the encoding is set up correctly, it already contains the correct characters, but you won't necessarily see that until you convert it to an encoding that you can display. So where I show "John Da\xF1e" that's only because my terminal is set to display UTF-8 - \xF1 is the byte for ñ in Latin-1 encoding.
As far as I can tell, the URI encoding for UTF-8 bytes of the same string in a single step looks like this:
"John%20Da%C3%B1e"
CGI::unescape( "John%20Da%C3%B1e" )
# => "John Dañe"

Ruby fixing multiple encoding documents

I'm trying to retrieve a Web page, and apply a simple regular expression on it.
Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:
ArgumentError (invalid byte sequence in UTF-8)
I've tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:
content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)
content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")
Here's the complete code:
response = Net::HTTP.get_response(url)
#encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (#encoding)
#content =response.body.force_encoding(#encoding)
#content = Iconv.conv(#encoding + '//IGNORE', #encoding, #content);
else
#content = response.body
end
#content.gsub!(/.../, "") # bang
Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.
Thanks!
I had a similar problem importing emails with different encodings, I ended with this:
def enforce_utf8(from = nil)
begin
self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
rescue
converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT')
converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
end
end
at first, it tries to convert from *some_format* to UTF-8, in case there isn't any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).
let me know if it works for you ;)
A.
Use the ASCII-8BIT encoding instead.

Resources