Gsub raises "invalid byte sequence in UTF-8" - ruby-on-rails

I have the next method call:
Formatting.git_log_to_html(`git log --no-merges master --pretty=full #{interval}`)
The value of interval is something like release-20130325-01..release-20130327-04.
The git_log_to_html ruby method is the next (I am only pasting the line what raises the error):
module Formatting
def self.git_log_to_html(git_log)
...
git_log.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
...
end
end
This used to work, but actually I checked that gsub is raising an "invalid byte sequence in UTF-8" error.
Could you help to understand why and how can I fix it? :/
Here is the output of git_log:
https://dl.dropbox.com/u/42306424/output.txt

For some reason, this command:
git log --no-merges master --pretty=full #{interval}
is giving you a result that is not encoded in UTF-8, it may be that your computer is working with a different charset, try the following:
module Formatting
def self.git_log_to_html(git_log)
...
git_log.force_encoding("utf8").gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
...
end
end
I'm not sure if that will work, but you can try.
If that doesn't work, you can check ruby iconv to detect the charset and encode it on utf-8: http://www.ruby-doc.org/stdlib-2.0/libdoc/iconv/rdoc/
Based on the file you added on the comment, I did:
require 'open-uri'
content = open('https://dl.dropbox.com/u/42306424/output.txt').read
content.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit")
and worked nice without any kind of troubles
btw, you can try:
require 'iconv'
module Formatting
def self.git_log_to_html(git_log)
...
git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log
git_log.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
...
end
end
but you should really detect the charset of the string before attempting a conversion to utf-8.

Related

How to write a Tempfile as binary

When trying to write a string / unzipped file to a Tempfile by doing:
temp_file = Tempfile.new([name, extension])
temp_file.write(unzipped_io.read)
Which throws the following error when I do this with an image:
Encoding::UndefinedConversionError - "\xFF" from ASCII-8BIT to UTF-8
When researching it I found out that this is caused because Ruby tries to write files with an encoding by default (UTF-8). But the file should be written as binary, so it ignores any file specific behavior.
Writing regular File you would be able to do this as following:
File.open('/tmp/test.jpg', 'rb') do |file|
file.write(unzipped_io.read)
end
How to do this in Tempfile
Tempfile.new passes options to File.open which accepts the options from IO.new, in particular:
:binmode
If the value is truth value, same as “b” in argument mode.
So to open a tempfile in binary mode, you'd use:
temp_file = Tempfile.new([name, extension], binmode: true)
temp_file.binmode? #=> true
temp_file.external_encoding #=> #<Encoding:ASCII-8BIT>
In addition, you might want to use Tempfile.create which takes a block and automatically closes and removes the file afterwards:
Tempfile.create([name, extension], binmode: true) do |temp_file|
temp_file.write(unzipped_io.read)
# ...
end
I have encountered the solution in an old Ruby forum post, so I thought I would share it here, making it easier for people to find:
https://www.ruby-forum.com/t/ruby-binary-temp-file/116791
Apparently Tempfile has an undocumented method binmode, which changes the writing mode to binary and thus ignoring any encoding issues:
temp_file = Tempfile.new([name, extension])
temp_file.binmode
temp_file.write(unzipped_io.read)
Thanks unknown person who mentioned it on ruby-forums.com in 2007!
Another alternative is IO.binwrite(path, file_content)

Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:
tr -cd '^[:print:]' < original.xml > clean.xml
Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.
The problem is that we're deploying on Heroku, and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:
data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
data.push(newrow)
end
end
Running this on the raw file produces an error:
"Invalid byte sequence in UTF-8"
Here are all the helpful suggestions I've tried but all have failed.
Use Coder
Coder.clean!(data_string, "UTF-8")
Force Encoding
data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
Convert to UTF-16 and back to UTF-8
data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
data_string.encode!('UTF-8', 'UTF-16')
Use valid_encoding?
data_string.chars.select{|i| i.valid_encoding?}.join
No characters are removed; generates "invalid byte sequence" errors.
Specify encoding on opening the file
I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (#file_encodings is an array of every possible file encoding):
#file_encodings.each do |enc|
print "#{enc}..."
conv_str = "r:#{enc}:utf-8"
begin
data_file = File.open(fname, conv_str)
data_string = data_file.read
rescue
data_file = nil
data_string = ""
end
data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")
unless data_string.blank? print "\n#{enc} detected!\n"
return data_string
end
Use Regexp to remove non-printables:
data_string.gsub!(/[^[:print:]]/,"")
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
(I also tried variants including /[^a-zA-Z0-9~`!##$%^&*()-_=+[{]}\|;:'",<.>/\?]/)
For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.
So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.
What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):
data_string.gsub!(/[Crazy Control Characters]/,"")
But the purist in me insists there should be a more elegant, general solution.
Ruby 2.1 has a new method called String.scrub which is exactly what you need.
If the string is invalid byte sequence then replace invalid bytes with
given replacement character, else returns self. If block is given,
replace invalid bytes with returned value of the block.
Check the documentation for more information.
I found this on Stack Overflow for some other question and this too worked fine for me. Assuming data_string is your XML:
data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Thanks for the responses. I did find something that works by testing all sorts of combinations of different tools. I hope this is helpful to other people who have shared the same frustration.
data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).
My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences; and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. I had to fix both problems, in that order, otherwise gsub! throws the "invalid byte sequence" error.
Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:
Coder.clean!(data_string,'UTF-8')
or
data_string.scrub!("")
This worked perfectly for me.
Try using a combination of force_encoding("ISO-8859-1") and encode("utf-8"):
data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
This helped me once.

Ruby fixing multiple encoding documents

I'm trying to retrieve a Web page, and apply a simple regular expression on it.
Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:
ArgumentError (invalid byte sequence in UTF-8)
I've tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:
content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)
content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")
Here's the complete code:
response = Net::HTTP.get_response(url)
#encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (#encoding)
#content =response.body.force_encoding(#encoding)
#content = Iconv.conv(#encoding + '//IGNORE', #encoding, #content);
else
#content = response.body
end
#content.gsub!(/.../, "") # bang
Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.
Thanks!
I had a similar problem importing emails with different encodings, I ended with this:
def enforce_utf8(from = nil)
begin
self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
rescue
converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT')
converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
end
end
at first, it tries to convert from *some_format* to UTF-8, in case there isn't any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).
let me know if it works for you ;)
A.
Use the ASCII-8BIT encoding instead.

ActionView::Template::Error (incompatible character encodings: UTF-8 and ASCII-8BIT)

I am using Ruby 1.9.2, Rails 3.0.4/3.0.5 and Phusion Passenger 3.0.3/3.0.4. My templates are written in HAML and I am using the MySQL2 gem. I have a controller action that when passed a parameter that has a special character, like an umlaut, gives me the following error:
ActionView::Template::Error (incompatible character encodings: UTF-8 and ASCII-8BIT)
The error points to the first line of my HAML template, which has the following code on it:
<!DOCTYPE html>
My understanding is that this is caused because I have a UTF-8 string that is being concatenated with an ASCII-8BIT string, but I can't for the life of me figure out what that ASCII-8BIT string is. I have checked that the params in the action are encoded using UTF-8 and I have added an encoding: UTF-8 declaration to the top of the HAML template and the ruby files and I still get this error. My application.rb file has a config.encoding = "UTF-8" declaration in it as well and the following all result in UTF-8:
ENV['LANG']
__ENCODING__
Encoding.default_internal
Encoding.default_external
Here's the kicker: I cannot reproduce this result locally on my Mac-OSX using standalone passenger or mongrel in either development or production. I can only reproduce it on a production server running nginx+passenger on linux. I have verified in the production server's console that the latter mentioned commands all result in UTF-8 as well.
Have you experienced this same error and how did you solve it?
After doing some debugging I found out the issue occurs when using the ActionDispatch::Request object which happens to have strings that are all coded in ASCII-8BIT, regardless of whether my app is coded in UTF-8 or not. I do not know why this only happens when using a production server on Linux, but I'm going to assume it's some quirk in Ruby or Rails since I was unable to reproduce this error locally. The error occurred specifically because of a line like this:
#current_path = request.env['PATH_INFO']
When this instance variable was printed in the HAML template it caused an error because the string was encoded in ASCII-8BIT instead of UTF-8. To solve this I did the following:
#current_path = request.env['PATH_INFO'].dup.force_encoding(Encoding::UTF_8)
Which forced #current_path to use a duplicated string that was forced into the proper UTF-8 encoding. This error can also occur with other request related data like request.headers.
Mysql could be the source of troublesome ascii. Try putting the following in initializer to at least eliminate this possibility:
require 'mysql'
class Mysql::Result
def encode(value, encoding = "utf-8")
String === value ? value.force_encoding(encoding) : value
end
def each_utf8(&block)
each_orig do |row|
yield row.map {|col| encode(col) }
end
end
alias each_orig each
alias each each_utf8
def each_hash_utf8(&block)
each_hash_orig do |row|
row.each {|k, v| row[k] = encode(v) }
yield(row)
end
end
alias each_hash_orig each_hash
alias each_hash each_hash_utf8
end
edit
This may not be applicable to mysql2 gem. Works for mysql however.

Rails 3 Encoding::CompatibilityError

I am working on a rails app that submits a french translation via ajax and for some reason I keep getting the following error in the log:
Encoding::CompatibilityError incompatible character encodings: UTF-8 and ASCII-8BIT
Does anyone know how to fix this?
FIX:This works on the WEBrick sever
Place # encode: UTF-8 at the top of each file you want to work with different chars
I can't get this to work on a rails server with Thin... anyone else run into this?
https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors
the above link fixed my problem.
Specifically myString.force_encoding('UTF-8') on the string before sending it for translation.
Placed the sample code in the Application_controller.rb file and all is well
I know this is old, but I had the same problem and the solution was in the link #dennismonsewicz gave. In detail, the code was:
was:
before_filter :force_utf8_params
def force_utf8_params
traverse = lambda do |object, block|
if object.kind_of?(Hash)
object.each_value { |o| traverse.call(o, block) }
elsif object.kind_of?(Array)
object.each { |o| traverse.call(o, block) }
else
block.call(object)
end
object
end
force_encoding = lambda do |o|
o.force_encoding(Encoding::UTF_8) if o.respond_to?(:force_encoding)
end
traverse.call(params, force_encoding)
end
I fixed this issue by converting an utf8 file to ascii.
See the answer here: ruby 1.9 + sinatra incompatible character encodings: ASCII-8BIT and UTF-8

Resources