How to change html encoded character to ascii character - ruby-on-rails

I have a french character that is encoded as follows:
"Jos\xE9e"
I need to convert it to regular character because it produces this error on my server:
invalid byte sequence in UTF-8
What can I do to fix this error?
Rails 3 Ruby 1.9.2

That looks like "Josée" encoded in ISO 8859-1 (AKA Latin-1). You can use Iconv to convert it to UTF-8:
require 'iconv'
utf_string = Iconv.conv('UTF-8', 'ISO-8859-1', "Jos\xE9e")

Use a editor support utf-8, and add coding line at the top of all source files:
# coding: utf-8
If some input string is not utf-8, convert it to utf-8 first before processing:
input_str = "Jos\xE9e"
utf_input = input_str.force_encoding('iso-8859-1').encode('utf-8')
All above only work under ruby 1.9. For more information, you can check the book: Ruby Best Practices.

you should use utf8 in all your source code, how about save your file in utf-8 encoding

Related

Rails/Ruby invalid byte sequence in UTF-8 even after force_encoding

I'm trying to iterate over a remote nginx log file (compressed .gz file) in Rails and I'm getting this error at some point in the file:
TTPArgumentError: invalid byte sequence in UTF-8
I tried forcing the encoding too although it seems the encoding was already UTF8:
logfile = logfile.force_encoding("UTF-8")
The method that I'm using:
def remote_update
uri = "http://" + self.url + "/localhost.access.log.2.gz"
source = open(uri)
gz = Zlib::GzipReader.new(source)
logfile = gz.read
# prints UTF-8
print logfile.encoding.name
logfile = logfile.force_encoding("UTF-8")
# prints UTF-8
print logfile.encoding.name
logfile.each_line do |line|
print line[/\/someregex\/1\/(.*)\//,1]
end
end
Really trying to understand why this is happening (tried to look in other SO threads with no success). What's wrong here?
Update:
Added exception's trace:
HTTPArgumentError: invalid byte sequence in UTF-8
from /Users/T/workspace/sample_app/app/models/server.rb:25:in `[]'
from /Users/T/workspace/sample_app/app/models/server.rb:25:in `block in remote_update'
from /Users/T/workspace/sample_app/app/models/server.rb:24:in `each_line'
from /Users/T/workspace/sample_app/app/models/server.rb:24:in `remote_update'
from (irb):2
from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:110:in `start'
from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:9:in `start'
force_encoding doesn't change the actual string data: it just changes the variable that says what encoding to use when interpreting the bytes.
If the data is not in fact utf-8 or contains invalid utf-8 sequences then force encoding won't help. Force encoding is basically only useful when you get some raw data from somewhere and you know what encoding it is in and you want to tell ruby what that encoding is.
The first thing to do would be to determine what is the actual encoding used. The charlock_holmes gem can detect encodings. A more tricky case would be if the file was a mish-mash of encodings but hopefully that isn't the case (if it was, then perhaps trying to handle each line separately might work).
If you want to take a string, which has the correct encoding, and transcode it to valid UTF-8 and clean up invalid characters you can use something like:
str.encode!('UTF-8', invalid: :replace, undef: :replace, replace: '?')
If you have a UTF-8 encoded string which has invalid UTF-8 characters in it, you can clean that up by using the 'binary' encoding source:
str.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '?')
Both will give you a UTF-8 string with any invalid characters replaced by question marks which should pass. You can also pass in replace: '' to strip the bad characters, or leave the option off and you'll get the \uFFFD unicode character.
My guess is that the source file before gzipping had some binary data/corruption/invalid UTF-8 that got logged into it?
This question has also been asked and answered before on StackOverflow. See the following blog post for good information:
https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences
And here's a prior example of a SO answer:
https://stackoverflow.com/a/18454435/506908

Force strings to UTF-8 from any encoding

In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.
How can I detect encoding and convert to UTF-8?
Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8')
str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode:
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string
This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.
This will ensure no matter what, that you have a valid UTF-8 string
str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
Iconv
require 'iconv'
i = Iconv.new('UTF-8','LATIN1')
a_with_hat = i.iconv("\xc2")
Summary: the iconv gem does all the work of converting encodings. Make sure it's installed with:
gem install iconv
Now, you need to know what encoding your string is currently in as Ruby 1.8 treats Strings as an array of bytes (with no intrinsic encoding.) For example, say your string was in latin1 and you wanted to convert it to utf-8
require 'iconv'
string_in_utf8_encoding = Iconv.conv("UTF8", "LATIN1", string_in_latin1_encoding)
Only this solution worked for me:
string.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Note the binary argument.

Ruby on Rails 3 => truncate method with special characters throws Encoding Incompatability error

I need some help with the following. I got a string here which contains special characters e.g. ë, é etc. I can display them correctly in my view but once I call the truncate method, it throws the following error:
incompatible character encodings: ASCII-8BIT and UTF-8
The weird thing is that, when I inspect the encoding of the truncated string, it does give me UTF-8, which is what I need (and UTF-8 is used for my database).
my_string_with_special_characters.truncate(35).encoding.inspect
=> UTF-8
But is is when I call:
<%= my_string_with_special_characters.truncate(35) %>
=> incompatible character encodings: ASCII-8BIT and UTF-8
I have also tried the magic_encoding gem which prepends the magic comment
"encoding : utf-8" in all of my controller files, but I still got the incompatible character encoding error.
If anyone knows how to solve this, let me know. Much appreciated.
Alex
Try to use this string in the beginning of you file (for *.rb files)
# -*- encoding: utf-8 -*-

Character Encoding issue in Rails v3/Ruby 1.9.2

I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.
open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error
Contents of the file:
# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works
This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(
Environment:
Rails v3.0.3
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]
Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.
Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:
File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }
If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).
If you want to change your file encoding, you can use gem 'charlock holmes'
https://github.com/brianmario/charlock_holmes
$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end
Then you can save your new content in a temp file and overwrite your original file.
Hope this help.

Ruby On Rails and UTF-8

I have an Rails application with SayController, hello action and view template say/hello.html.erb. When I add some cyrillic character like "ю", I get an error:
ArgumentError in SayController#hello
invalid byte sequence in UTF-8
Headers:
{"Cache-Control"=>"no-cache",
"X-Runtime"=>"11",
"Content-Type"=>"text/html; charset=utf-8"}
If I try to write this letter with embedded Ruby,
<%= "ю" %>
I don't get any error, but it displays a question mark in black square (�) instead of this letter.
I use Windows 7 x64, Ruby 1.9.1p378, Rails 2.3.5, WEBrick server.
A likely cause of this error is that the file which contains the cyrillic letters is not encoded in UTF8, but perhaps in some russian encoding like KOI8. This will cause the characters to be impossible to interpret in UTF8 (and rightly so!).
So double check that your file is properly encoded in UTF8.
Create a initializer file (e.g encoding_fix.rb) under your_app/config/initializers with the following content:
Encoding.default_internal = Encoding::UTF_8 if RUBY_VERSION > "1.9"
Encoding.default_external = Encoding::UTF_8 if RUBY_VERSION > "1.9"
This sets the encoding to utf8.

Resources