Ruby string force encoding - ruby-on-rails

How can I force encode this: Al-F\u0026#257;ti\u0026#293;ah to Al-Fātiĥah
I tried .encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "") and force_encoding("UTF-8") with no luck

That text seems to include HTML or XML entities.
Try
require "cgi/util"
CGI.unescapeHTML("Al-F\u0026#257;ti\u0026#293;ah")
or
# gem install nokogiri
require "nokogiri"
Nokogiri::XML.fragment("Al-F\u0026#257;ti\u0026#293;ah").text
See: Converting escaped XML entities back into UTF-8

Related

How to decode utf8 characters with ruby Base64

Just as the title describes, how can I do the following?
require 'base64'
text = 'éééé'
encode = Base64.encode64(text)
Base64.decode64(encode)
Result: éééé instead of \xC3\xA9\xC3\xA9
When you decode64 you get back a string with BINARY (a.k.a. ASCII-8BIT) encoding:
Base64.decode64(encode).encoding
# => #<Encoding:ASCII-8BIT>
The trick is to force-apply a particular encoding:
Base64.decode64(encode).force_encoding('UTF-8')
# => "éééé"
This assumes that your string is valid UTF-8, which it might not be, so use with caution.
Just use Base64's encode and decode method:
require 'base64'
=> true
Base64.encode64('aksdfjd')
=> "YWtzZGZqZA==\n"
Base64.decode64 "YWtzZGZqZA==\n"
=> "aksdfjd"

Rails 4.2 - how to fix ascii code in CSV exporting without gem 'iconv'?

When exporting csv in Rails 4.2 app, there are ascii code in the csv output for Chinese characters (UTF8):
中åˆåŒç†Šå·¥ç­‰ç”¨é¤
We tried options in send_data without luck:
send_data #payment_requests.to_csv, :type => 'text/csv; charset=utf-8; header=present'
And:
send_data #payment_requests.to_csv.force_encoding("UTF-8")
In model, there is forced encoding utf8:
# encoding: utf-8
But it does not work. There are online posts talking about use gem iconv. However iconv depends on the platform's ruby version. Is there cleaner solution to fix the ascii in Rails 4.2 csv exporting?
If #payment_requests.to_csv includes ASCII text, then you should use encode method:
#payment_requests.to_csv.encode("UTF-8")
or
#payment_requests.to_csv.force_encoding("ASCII").encode("UTF-8")
depending on which internal encoding #payment_requests.to_csv has.
You can try:
#payment_requests.to_csv.force_encoding("ISO-8859-1")
for Chinese characters
CSV.generate(options) do |csv|
csv << column_names
all.each do |product|
csv << product.attributes.values_at(*column_names)
end
end.encode('gb2312', :invalid => :replace, :undef => :replace, :replace => "?")
This is what worked for me:
head = 'EF BB BF'.split(' ').map{|a|a.hex.chr}.join()
csv_str = CSV.generate(csv = head) do |csv|
csv << [ , , , ...]
#elements.each do |element|
csv << [ , , , ...]
end
end

"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8

I have a rails project that runs fine with MRI 1.9.3. When I try to run with Rubinius I get this error in app/views/layouts/application.html.haml:
"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8
It turns out the page had an invalid character (an interpunct '·'), which I found out with the following code (credits to this gist and this question):
lines = IO.readlines("app/views/layouts/application.html.haml").map do |line|
line.force_encoding('ASCII-8BIT').encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '?')
end
File.open("app/views/layouts/application.html.haml", "w") do |file|
file.puts(lines)
end
After running this code, I could find the problematic characters with a simple git diff and moved the code to a helper file with # encoding: utf-8 at the top.
I'm not sure why this doesn't fail with MRI but it should since I'm not specifying the encoding of the haml file.

UTF-8 conversion not working with String#encode but Iconv

I had this with Iconv:
git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log
Now I want to change it to use String#encode due to deprecation warnings, but I can't, doesn't work:
git_log = git_log.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')
I used to use Iconv here, and it's still working:
https://github.com/gamersmafia/gamersmafia/blob/master/lib/formatting.rb#L244
But when I replace these line with String#encode method, first gsub raises a "invalid byte sequence in UTF-8" error.
Do you know why?
In your call to String#encode you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.
String#encode has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:
git_log = git_log.encode(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')
You could also use the ! form in this case, which has the same effect:
git_log.encode!(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')
Try the following approach, which removes a character from a string if the character is mal-encoded:
invalid_character_indices = []
mystring.each_char.with_index do |char, i|
invalid_character_indices << i unless char == char.encode(Encoding::UTF_8, Encoding::ISO_8859_1,:invalid => :replace, :undef => :replace, :replace => "")
end
invalid_character_indices.each do |i|
mystring.delete!(mystring[i])
end

How to change the encoding during CSV parsing in Rails

I would like to know how can I change the encoding of my CSV file when I import it and parse it. I have this code:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
csv.each do |row|
row = row.to_hash.with_indifferent_access
insert_data_method(row)
end
When I read my file, I get this error:
Encoding::CompatibilityError in FileImportingController#load_file
incompatible character encodings: ASCII-8BIT and UTF-8
I read about row.force_encoding('utf-8') but it does not work:
NoMethodError in FileImportingController#load_file
undefined method `force_encoding' for #<ActiveSupport::HashWithIndifferentAccess:0x2905ad0>
Thanks.
I had to read CSV files encoded in ISO-8859-1.
Doing the documented
CSV.foreach(filename, encoding:'iso-8859-1:utf-8', col_sep: ';', headers: true) do |row|
threw the exception
ArgumentError: invalid byte sequence in UTF-8
from csv.rb:2027:in '=~'
from csv.rb:2027:in 'init_separators'
from csv.rb:1570:in 'initialize'
from csv.rb:1335:in 'new'
from csv.rb:1335:in 'open'
from csv.rb:1201:in 'foreach'
so I ended up reading the file and converting it to UTF-8 while reading, then parsing the string:
CSV.parse(File.open(filename, 'r:iso-8859-1:utf-8'){|f| f.read}, col_sep: ';', headers: true, header_converters: :symbol) do |row|
pp row
end
force_encoding is meant to be run on a string, but it looks like you're calling it on a hash. You could say:
output.force_encoding('utf-8')
csv = CSV.parse(output, :headers => true, :col_sep => ";")
...
Hey I wrote a little blog post about what I did, but it's slightly more verbose than what's already been posted. For whatever reason, I couldn't get those solutions to work and this did.
This gist is that I simply replace (or in my case, remove) the invalid/undefined characters in my file then rewrite it. I used this method to convert the files:
def convert_to_utf8_encoding(original_file)
original_string = original_file.read
final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
final_file = Tempfile.new('import') #No need to save a real File
final_file.write(final_string)
final_file.close #Don't forget me
final_file
end
Hope this helps.
Edit: No destination encoding is specified here because encode assumes that you're encoding to your default encoding which for most Rails applications is UTF-8 (I believe)

Resources