Encoding::UndefinedConversionError: "\xE4" from ASCII-8BIT to UTF-8 - ruby-on-rails

I tried to fetch this CSV-File with Net::HTTP.
File.open(file, "w:UTF-8") do |f|
content = Net::HTTP.get_response(URI.parse(url)).body
f.write(content)
end
After reading my local csv file again, i got some weird output.
Nationalit\xE4t;Alter 0-5
I tried to encode it to UTF-8, but got the error Encoding::UndefinedConversionError: "\xE4" from ASCII-8BIT to UTF-8
The rchardet gem tolds me the content is ISO-8859-2. But convert to UTF-8 will not work.
After open it in a normal Texteditor, i see it normal encoded.

You can go with force_encoding:
require 'net/http'
url = "http://data.linz.gv.at/katalog/population/abstammung/2012/auslg_2012.csv"
File.open('output', "w:UTF-8") do |f|
content = Net::HTTP.get_response(URI.parse(url)).body
f.write(content.force_encoding("UTF-8"))
end
But this will make you lose some acentuation in your .cvs file
If you are deadly sure that you always will use this URL as input, and the file will always keep this encoding, you can do
# encoding: utf-8
require 'net/http'
url = "http://data.linz.gv.at/katalog/population/abstammung/2012/auslg_2012.csv"
File.open('output', "w:UTF-8") do |f|
content = Net::HTTP.get_response(URI.parse(url)).body
f.write(content.encode("UTF-8", "ISO-8859-15"))
end
But this will only work to this file.

Related

Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)

ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard 'http://www.exmaple.com/page' and work in my browser)?
Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.
The code:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
The error message:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
Two things:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
The issue seems to be the encoding of the file listOfURLs.xls, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
puts row
end
Some good info about invalid byte sequences in UTF-8
Update:
An example:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
I'm a bit late to the party here, but this should work for anyone running into the same issue in the future:
csv_doc = IO.read(file).force_encoding('ISO-8859-1').encode('utf-8', replace: nil)

Rails incompatible character encodings: UTF-8 and ASCII-8BIT

In my Rails app,I send a post request:
require 'net/http'
url="http://192.168.0.84:809/Services/SDService.asmx/UserRegister"
Net::HTTP.post_form(URI(url),{:memtyp=>'CU',:memid=>'100867',:dob=>'1989-01-01'}).body
But I got the error:
incompatible character encodings: UTF-8 and ASCII-8BIT
I found that the response data include UTF-8 character just like 中文,and then I got this error.
so what should I do?
You can send data in json format if you want to do so you can do this by following:
require 'rest_client'
require "net/http"
require "uri"
require 'json'
RestClient.post 'localhost:3001/users',{:memtyp=>'CU',:memid=>'100867',:dob=>'1989-01-01'}.to_json , :content_type => :json, :accept => :json
Plese change the localhost url to actual url you want to hit.

"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8

I have a rails project that runs fine with MRI 1.9.3. When I try to run with Rubinius I get this error in app/views/layouts/application.html.haml:
"\xC2" to UTF-8 in conversion from ASCII-8BIT to UTF-8
It turns out the page had an invalid character (an interpunct '·'), which I found out with the following code (credits to this gist and this question):
lines = IO.readlines("app/views/layouts/application.html.haml").map do |line|
line.force_encoding('ASCII-8BIT').encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '?')
end
File.open("app/views/layouts/application.html.haml", "w") do |file|
file.puts(lines)
end
After running this code, I could find the problematic characters with a simple git diff and moved the code to a helper file with # encoding: utf-8 at the top.
I'm not sure why this doesn't fail with MRI but it should since I'm not specifying the encoding of the haml file.

Generating a PDF With Images from Base64 with Prawn

I am trying to save multiple pngs in one pdf. I'm receiving the PNGs from an API Call to the Endicia Label Server, which is giving me a Base64 Encoded Image as response.
Based on this Question:
How to convert base64 string to PNG using Prawn without saving on server in Rails
def batch_order_labels
#orders = Spree::Order.ready_to_ship.limit(1)
dt = Date.current.strftime("%d %b %Y ")
title = "Labels - #{dt} - #{#orders.count} Orders"
Prawn::Document.generate("#{title}.pdf") do |pdf|
#orders.each do |order|
label = order.generate_label
if order.international?
#image = label.response_body.scan(/<Image PartNumber=\"1\">([^<>]*)<\/Image>/imu).flatten.last
else
#image = label.image
end
file = Tempfile.new('labelimg', :encoding => 'utf-8')
file.write Base64.decode64(#image)
file.close
pdf.image file
pdf.start_new_page
end
end
send_data("#{title}.pdf")
end
But I'm receiving following error:
"\x89" from ASCII-8BIT to UTF-8
Any Idea?
There's no need to write the image data to a tempfile, Prawn::Document#image can accept a StringIO.
Try replacing this:
file = Tempfile.new('labelimg', :encoding => 'utf-8')
file.write Base64.decode64(#image)
file.close
pdf.image file
With this:
require 'stringio'
.....
image_data = StringIO.new( Base64.decode64(#image) )
pdf.image(image_data)
The Problem is, that the Api is returning this thing in UTF-8 - So I dont have a great choice.
Anyhow, I found this solution to be working
file = Tempfile.new('labelimg', :encoding => 'utf-8')
File.open(file, 'wb') do |f|
f.write Base64.decode64(#image)
end
you can't convert the Base64 to UTF-8.
Leave it as plain ASCII:
file = Tempfile.new('labelimg', :encoding => 'ascii-8bit')
file.write Base64.decode64(#image)
file.close
or even better - leave it as binary:
file = Tempfile.new('labelimg')
file.write Base64.decode64(#image)
file.close
UTF-8 is multibite format and it's not usable for transferring binary data such as pics.

How to change the encoding during CSV parsing in Rails

I would like to know how can I change the encoding of my CSV file when I import it and parse it. I have this code:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
csv.each do |row|
row = row.to_hash.with_indifferent_access
insert_data_method(row)
end
When I read my file, I get this error:
Encoding::CompatibilityError in FileImportingController#load_file
incompatible character encodings: ASCII-8BIT and UTF-8
I read about row.force_encoding('utf-8') but it does not work:
NoMethodError in FileImportingController#load_file
undefined method `force_encoding' for #<ActiveSupport::HashWithIndifferentAccess:0x2905ad0>
Thanks.
I had to read CSV files encoded in ISO-8859-1.
Doing the documented
CSV.foreach(filename, encoding:'iso-8859-1:utf-8', col_sep: ';', headers: true) do |row|
threw the exception
ArgumentError: invalid byte sequence in UTF-8
from csv.rb:2027:in '=~'
from csv.rb:2027:in 'init_separators'
from csv.rb:1570:in 'initialize'
from csv.rb:1335:in 'new'
from csv.rb:1335:in 'open'
from csv.rb:1201:in 'foreach'
so I ended up reading the file and converting it to UTF-8 while reading, then parsing the string:
CSV.parse(File.open(filename, 'r:iso-8859-1:utf-8'){|f| f.read}, col_sep: ';', headers: true, header_converters: :symbol) do |row|
pp row
end
force_encoding is meant to be run on a string, but it looks like you're calling it on a hash. You could say:
output.force_encoding('utf-8')
csv = CSV.parse(output, :headers => true, :col_sep => ";")
...
Hey I wrote a little blog post about what I did, but it's slightly more verbose than what's already been posted. For whatever reason, I couldn't get those solutions to work and this did.
This gist is that I simply replace (or in my case, remove) the invalid/undefined characters in my file then rewrite it. I used this method to convert the files:
def convert_to_utf8_encoding(original_file)
original_string = original_file.read
final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
final_file = Tempfile.new('import') #No need to save a real File
final_file.write(final_string)
final_file.close #Don't forget me
final_file
end
Hope this helps.
Edit: No destination encoding is specified here because encode assumes that you're encoding to your default encoding which for most Rails applications is UTF-8 (I believe)

Resources