"\x9D" to UTF-8 in conversion from Windows-1252 to UTF-8 - ruby-on-rails

I have created a csv uploader on my rails app, but sometimes I get an error of
"\x9D" to UTF-8 in conversion from Windows-1252 to UTF-8
This is the source to my uploader:
def self.import(file)
CSV.foreach(file.path, headers: true, encoding: "windows-1252:utf-8") do |row|
title = row[1]
row[1] = title.to_ascii
description = row[2]
row[2] = description.to_ascii
Event.create! row.to_hash
end
end
I am using the unidecode gem (https://github.com/norman/unidecoder) to normalize any goofy characters that a user may input. I've ran into this error a few times, but can't determine how to fix it. I thought the encoding: "windows-1252:utf-8" line would fix the problem, but nothing there.
Thanks stack!

There is no 9D character (as well as 81, 8D, 8F, 90) in Windows-1252. It means your text is not in Windows-1252 encoding. At the very least your source text is corrupt.

I was running into this error from reading url contents:
table = CSV.parse(URI.open(document.url).read)
Turns out the API I am fetching conditionally returns GZIP if the file is too large.
Another annoying thing is that rails decompression was then failing on a valid UTF8 error.
This did NOT work:
ActiveSupport::Gzip.decompress(URI.open(document.url).read)
This did work:
Zlib::GzipReader.wrap(URI.open(document.url), &:read)
My next problem is the CSV.parse() reads the entire blob, and I had a single line with errors.
downloaded_file = StringIO.new(Zlib::GzipReader.wrap(URI.open(document.url), &:read))
tempfile = Tempfile.new("open-uri", binmode: true)
IO.copy_stream(downloaded_file, tempfile.path)
headers = nil
File.foreach(tempfile.path) do |line|
row = []
if headers.blank?
headers = CSV.parse_line(line, { col_sep: "\t", liberal_parsing: true })
else
line_data = CSV.parse_line(line.force_encoding("UTF-8").encode('UTF-8', :invalid => :replace, :undef => :replace), { col_sep: "\t", liberal_parsing: true })
row = headers.zip(line_data)
end
puts row.inspect
... # do a lot more stuff
end
wow.

Related

Invalid Byte Sequence in UTF-8 from Excel file

(Ruby 2.5) I have a method that reads and parses a csv file that's being uploaded via Alchemy CMS
def process_csv(csv_file, current_user_id, original_filename)
lock_importer
errors = []
index = 0
string_converter = lambda { |field| field.strip }
total = CSV.foreach(csv_file, headers: true).count
csv_string = csv_file.read.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string, headers: true, header_converters: :symbol, skip_blanks: true, converters: [string_converter] ) do |row|
# do other stuff
end
but when I try to upload a csv file that has a column (name) with a string that contains special characters then I receive the Invalid Byte Sequence in UTF-8 error. I'm trying to test the value N'öt Réal Stô'rë.
I've tried a few solutions that I found on the web but no luck - any suggestions?
It's unclear what your csv_fileis. I guess it is a File-object.
Sometimes I got csv from Excel as a UTF-16. So let's try an example:
I have a csv-file stored in UTF-16BE with the following content:
line;comment;UmlautÄ
1;Das ist UTF-16 BE;Ä
2;öüäÖÄÜ;Ä
If I execute the following code:
require 'csv'
def process_csv(csv_file)
csv_string = csv_file.read#.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string, headers: true, skip_blanks: true, col_sep: ';') do |row|
p row # do other stuff
end
end
process_csv(File.open('example_utf16BE.txt'))
then I get also a Invalid byte sequence in UTF-8-error.
If I use
process_csv(File.open('example_utf16BE.txt', 'rb', encoding: 'BOM|utf-16BE'))
then everything works.
So I guess, you get a File-object in a wron encoding and the code csv_file.read.encode!("UTF-8", "iso-8859-1", invalid: :replace) is a code part to repair this problem.
What you can do:
Add to you code:
p csv_file
p csv_file.external_encoding
You should get
#<File:example_utf16BE.txt>
#<Encoding:UTF-16BE>
Now check, if the file (in my example: example_utf16BE.txt has really the encoding of the 2nd line.
If not, try to adapt the File-object creation.
If this is not possible, then you can try to use csv_file.set_encoding 'utf-8' to change the encoding before you read the content.

Parse binary CSV file in Ruby

This should have been such an easy thing... buy I can't for the life of me figure out how to parse a CSV file that doesn't seem to have a specific encoding.
File.open(Rails.root.join('data', 'mike/test-csv.csv'), 'rb') { |f| f.read }
=> "ID,\x00Q\x00u\x00a\x00n\x00t\x00i\x00t\x00y\n\x006\x00e\x005\x004\x009\x001\x00e\x007\x00-\x007\x00f\x001\x005\x00-\x004\x001\x007\x00d\x00-\x00a\x004\x000\x003\x00-345\x00,\x00\x005\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x00\n"
Here's a gist of it, can't figure out a way to post the specific CSV.
All I get from checking the encoding of the file is that it's in binary format, any thoughts on how I could get it into a normal csv?
Note: This is a downloaded CSV so converting it to another encoding via opening it in excel and exporting (or something like that) is not an option :)
Thanks!
Updating with attempted solution 1:
path = Rails.root.join('data', 'mike/test-csv.csv')
CSV.read(path, {:headers => true, :encoding => 'utf-8'}).each do |d|
puts d
end
Result: 6e5491e7-7f15-417d-a403-345,50.00000000
While this is correct, it ONLY works with puts, for example:
CSV.read(path, {:headers => true, :encoding => 'utf-8'}).map { |row| row }
=> [#<CSV::Row "ID":"\u00006\u0000e\u00005\u00004\u00009\u00001\u0000e\u00007\u0000-\u00007\u0000f\u00001\u00005\u0000-\u00004\u00001\u00007\u0000d\u0000-\u0000a\u00004\u00000\u00003\u0000-345\u0000" "\u0000Q\u0000u\u0000a\u0000n\u0000t\u0000i\u0000t\u0000y":"\u0000\u00005\u00000\u0000.\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u0000">]
CSV.read(path, {:headers => true, :encoding => 'utf-8'}).map(&:to_s)
=> ["\u00006\u0000e\u00005\u00004\u00009\u00001\u0000e\u00007\u0000-\u00007\u0000f\u00001\u00005\u0000-\u00004\u00001\u00007\u0000d\u0000-\u0000a\u00004\u00000\u00003\u0000-345\u0000,\u0000\u00005\u00000\u0000.\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u0000\n"]
It's unfortunately still not the correct string :(
Final Solution (via #ashmaroli below):
path = Rails.root.join('data', 'mike/test-csv.csv')
csv_text = ''
File.open(path, 'r') do |csv|
csv.each_line do |line|
csv_text << line.gsub(/\u0000/, '')
end
end
CSV.parse(csv_text, headers:true).map do |row| row end
Result:
[#<CSV::Row "ID":"6e5491e7-7f15-417d-a403-345" "Quantity":"50.00000000">]
Github Gist
Download Example CSV File
path = Rails.root.join('data', 'mike/test-csv.csv')
file = ""
File.open(path, 'r') do |csv|
csv.each_line do |line|
file << line.gsub(/\u0000/, '')
end
end
print file
print file.inspect # same as above just wraps the string in a
# single line with "\n" chars

Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)

ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard 'http://www.exmaple.com/page' and work in my browser)?
Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.
The code:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
The error message:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
Two things:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
The issue seems to be the encoding of the file listOfURLs.xls, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
puts row
end
Some good info about invalid byte sequences in UTF-8
Update:
An example:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
I'm a bit late to the party here, but this should work for anyone running into the same issue in the future:
csv_doc = IO.read(file).force_encoding('ISO-8859-1').encode('utf-8', replace: nil)

Ruby CSV File Parsing, Headers won't format?

My rb file reads:
require "csv"
puts "Program1 initialized."
contents = CSV.open "data.csv", headers: true
contents.each do |row|
name = row[4]
puts name
end
...but when i run it in ruby it wont load the program. it gives me the error message about the headers:
syntax error, unexpected ':', expecting $end
contents = CSV.open "data.csv", headers: true
so I'm trying to figure out, why won't ruby let me parse this file? I've tried using other csv files I have and it won't load, and gives me an error message. I'm trying just to get the beginning of the program going! I feel like it has to do with the headers. I've updated as much as I can, mind you I'm using ruby 1.8.7. I read somewhere else that I could try to run the program in irb but it didn't seem like it needed it. so yeah... thank you in advance!!!!
Since you are using this with Ruby 1.8.7, :headers => true won't work in this way.
The simplest way to ignore the headers and get your data is to shift the first row in the data, which would be the headers:
require 'csv'
contents = CSV.open("data.csv", 'r')
contents.shift
contents.each do |row|
name = row[4]
puts name
end
If you do want to use the syntax with headers in ruby 1.8, you would need to use FasterCSV, something similar to this:
require 'fastercsv'
FasterCSV.foreach("data.csv", :headers => true) do |fcsv_obj|
puts fcsv_obj['name']
end
(Refer this question for further read: Parse CSV file with header fields as attributes for each row)

How to change the encoding during CSV parsing in Rails

I would like to know how can I change the encoding of my CSV file when I import it and parse it. I have this code:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
csv.each do |row|
row = row.to_hash.with_indifferent_access
insert_data_method(row)
end
When I read my file, I get this error:
Encoding::CompatibilityError in FileImportingController#load_file
incompatible character encodings: ASCII-8BIT and UTF-8
I read about row.force_encoding('utf-8') but it does not work:
NoMethodError in FileImportingController#load_file
undefined method `force_encoding' for #<ActiveSupport::HashWithIndifferentAccess:0x2905ad0>
Thanks.
I had to read CSV files encoded in ISO-8859-1.
Doing the documented
CSV.foreach(filename, encoding:'iso-8859-1:utf-8', col_sep: ';', headers: true) do |row|
threw the exception
ArgumentError: invalid byte sequence in UTF-8
from csv.rb:2027:in '=~'
from csv.rb:2027:in 'init_separators'
from csv.rb:1570:in 'initialize'
from csv.rb:1335:in 'new'
from csv.rb:1335:in 'open'
from csv.rb:1201:in 'foreach'
so I ended up reading the file and converting it to UTF-8 while reading, then parsing the string:
CSV.parse(File.open(filename, 'r:iso-8859-1:utf-8'){|f| f.read}, col_sep: ';', headers: true, header_converters: :symbol) do |row|
pp row
end
force_encoding is meant to be run on a string, but it looks like you're calling it on a hash. You could say:
output.force_encoding('utf-8')
csv = CSV.parse(output, :headers => true, :col_sep => ";")
...
Hey I wrote a little blog post about what I did, but it's slightly more verbose than what's already been posted. For whatever reason, I couldn't get those solutions to work and this did.
This gist is that I simply replace (or in my case, remove) the invalid/undefined characters in my file then rewrite it. I used this method to convert the files:
def convert_to_utf8_encoding(original_file)
original_string = original_file.read
final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
final_file = Tempfile.new('import') #No need to save a real File
final_file.write(final_string)
final_file.close #Don't forget me
final_file
end
Hope this helps.
Edit: No destination encoding is specified here because encode assumes that you're encoding to your default encoding which for most Rails applications is UTF-8 (I believe)

Resources